DEV Community: Kate Vu

Building a Serverless LLM Pipeline with Amazon Bedrock and SageMaker Fine-Tuning using AWS CDK

Kate Vu — Fri, 27 Feb 2026 04:21:02 +0000

Large-language models (LLMs) can support a wide range of use cases such as classification, summaries, etc. However they can require additional customization to incorporate domain-specific knowledge and up-to-date information.
In this blog we will build serverless pipelines that fine-tuning LLM Models using Amazon SageMaker, and deploying these models. Using AWS CDK as infrastructure as code, the solution separates training workflow from inference workflow, ensuring the production workloads remain stable and unaffected during model training and update. Additionally, leveraging Amazon AppConfig allows dynamic configuration updates without requiring redeployment.
The app is built using Kiro🔥

Architecture Overview

The system is composed of two main pipeline as the diagram below:

Training/Fine-tuning pipeline: responsible for data preparation, model fine-tuning, evaluation, and approval.
Inference pipeline: responsible for serving production request using the approved model.

1. Training pipeline

The training process is responsible for fine-tuning LLM models using AWS resources. While we rely on AWS resources to do the heavy job. The workflow is manually initiated.
Data Preparation:

Training datasets are downloaded from 1 in three sources: Hugging Face, Amazon public data, or synthesis data.
The data is formatted and splitted into 3 small sets: Training dataset Validation dataset Test dataset
The datasets are then uploaded to S3 bucket to be ready for training process Model Fine-Tuning:
Fine tuning is executed using Amazon SageMaker, and triggered by a python script.
The script supports both full-training and LoRA options with LoRa as default.
After the training job completes, evaluation metrics will be generated. If you satisfy with the result, register the model in SageMaker model registry and wait for approval. Automated deployment trigger: Once the model is approved, a lambda function will be triggered automatically to:
Create a new SageMaker endpoint.
Update AWS Systems Manager Parameter Store with the new endpoint.

2. Inference pipeline

This pipeline is responsible for handling realtime review summary requests from users. Incoming requests will be received via API Gateway, which invokes a lambda function to process it. The generated summaries will be stored in S3 bucket for later purposes such as auditing, monitoring, or analytical purposes.
To enable comparison between the foundation LLM model and the fine-tune model, the Lambda function first invokes a Foundation model. It then invokes the Amazon SageMaker endpoint created by the training pipeline above.
AWS AppConfig is used to manage runtime settings such as which model to invoke. This approach enables dynamic model switching without redeploying the whole application.

AWS Resources:

Amazon SageMaker
AWS S3 buckets
Amazon API Gateway
AWS Lambda
Amazon AppConfig
Amazon Parameter Store
Amazon Event Bridge
Amazon Bedrock
AWS Identity and Access Management (IAM)
Amazon CloudWatch

Prerequisites:

An AWS account that has been bootstrapped for AWS CDK
Environment setup:

Note.js
Typescript
AWS CDK Toolkit
Docker (used for bundling Lambda functions when deploying) AWS Credentials: keep them handy so you can deploy the stacks

Building the app

1. AppConfig stack

This will leverage Amazon AppConfig to store the config for runtime.First we define the json for each environment:

{
  "bedrock": {
    "modelId": "anthropic.claude-3-haiku-20240307-v1:0",
    "maxTokens": 200,
    "temperature": 0.5,
    "topP": 0.9
  },
  "sagemaker": {
    "enabled": true,
    "timeout": 30000,
    "models": {
      "stable": {
        "endpointName": "endpoint-kate",
        "description": "Kate's development model",
        "weight": 100
      }
    },
    "strategy": "weighted"
  },
  "rag": {
    "enabled": false,
    "topK": 3
  },
  "features": {
    "sentimentAnalysis": true,
    "caching": false,
    "useNewSummarizationPrompt": false,
    "enableAdvancedRAG": false,
    "useMultiModelEnsemble": false
  },
  "abTesting": {
    "enabled": false,
    "rules": []
  },
  "monitoring": {
    "logABTestAssignments": true,
    "trackModelPerformance": true,
    "metricsNamespace": "LLMPipeline/Kate"
  }
}

Then we create the stack

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as appconfig from 'aws-cdk-lib/aws-appconfig';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as fs from 'fs';
import * as path from 'path';
import { EnvironmentConfig } from './utils';

export interface AppConfigStackProps extends cdk.StackProps {
  config: EnvironmentConfig;
}

export class AppConfigStack extends cdk.Stack {
  public readonly application: appconfig.CfnApplication;
  public readonly appConfigEnvironment: appconfig.CfnEnvironment;
  public readonly configurationProfile: appconfig.CfnConfigurationProfile;

  constructor(scope: Construct, id: string, props: AppConfigStackProps) {
    super(scope, id, props);

    const { config } = props;

    // Create AppConfig Application
    this.application = new appconfig.CfnApplication(this, 'Application', {
      name: `llm-pipeline-${config.environmentName}`,
      description: 'Configuration for LLM Pipeline',
    });

    // Create AppConfig Environment
    this.appConfigEnvironment = new appconfig.CfnEnvironment(this, 'Environment', {
      applicationId: this.application.ref,
      name: config.environmentName,
      description: `${config.environmentName} environment`,
    });

    // Create Configuration Profile
    this.configurationProfile = new appconfig.CfnConfigurationProfile(this, 'ConfigProfile', {
      applicationId: this.application.ref,
      name: 'runtime-config',
      description: 'Runtime configuration for Lambda functions',
      locationUri: 'hosted',
      type: 'AWS.Freeform',
    });

    // Initial configuration with A/B testing support
    // These are RUNTIME settings that can be updated without redeployment
    // Loaded from config/appconfig-{environment}.json
    const configPath = path.join(__dirname, `../config/appconfig-${config.environmentName}.json`);

    let configContent: string;
    if (!fs.existsSync(configPath)) {
      throw new Error(
        `\n========================================\n` +
        `ERROR: AppConfig file missing for environment "${config.environmentName}"\n` +
        `========================================\n` +
        `Expected file: config/appconfig-${config.environmentName}.json\n` +
        `Full path: ${configPath}\n\n` +
        `Please create this file with runtime configuration.\n` +
        `You can copy from an existing environment:\n` +
        `  cp config/appconfig-kate.json config/appconfig-${config.environmentName}.json\n` +
        `========================================\n`
      );
    }

    try {
      configContent = fs.readFileSync(configPath, 'utf8');
      // Validate it's valid JSON
      JSON.parse(configContent);
      console.log(`✓ Loaded AppConfig for "${config.environmentName}" from: ${configPath}`);
    } catch (error) {
      throw new Error(
        `\n========================================\n` +
        `ERROR: Invalid AppConfig JSON for environment "${config.environmentName}"\n` +
        `========================================\n` +
        `File: config/appconfig-${config.environmentName}.json\n` +
        `Error: ${error instanceof Error ? error.message : String(error)}\n\n` +
        `Please ensure the file contains valid JSON.\n` +
        `Check for:\n` +
        `  - Missing commas\n` +
        `  - Trailing commas\n` +
        `  - Unquoted keys\n` +
        `  - Invalid escape sequences\n` +
        `========================================\n`
      );
    }

    // Create deployment strategy (immediate deployment)
    const deploymentStrategy = new appconfig.CfnDeploymentStrategy(this, 'DeploymentStrategy', {
      name: `immediate-${config.environmentName}`,
      deploymentDurationInMinutes: 0,
      growthFactor: 100,
      replicateTo: 'NONE',
      finalBakeTimeInMinutes: 0,
    });

    // Create hosted configuration version
    const configVersion = new appconfig.CfnHostedConfigurationVersion(this, 'ConfigVersion', {
      applicationId: this.application.ref,
      configurationProfileId: this.configurationProfile.ref,
      content: configContent,
      contentType: 'application/json',
      description: 'Initial configuration',
    });

    // Automatically deploy the configuration
    new appconfig.CfnDeployment(this, 'Deployment', {
      applicationId: this.application.ref,
      environmentId: this.appConfigEnvironment.ref,
      deploymentStrategyId: deploymentStrategy.ref,
      configurationProfileId: this.configurationProfile.ref,
      configurationVersion: configVersion.ref,
      description: 'Automatic deployment from CDK',
    });

    // Outputs
    new cdk.CfnOutput(this, 'ApplicationId', {
      value: this.application.ref,
      description: 'AppConfig Application ID',
      exportName: `${config.environmentName}-appconfig-app-id`,
    });

    new cdk.CfnOutput(this, 'EnvironmentId', {
      value: this.appConfigEnvironment.ref,
      description: 'AppConfig Environment ID',
      exportName: `${config.environmentName}-appconfig-env-id`,
    });

    new cdk.CfnOutput(this, 'ConfigurationProfileId', {
      value: this.configurationProfile.ref,
      description: 'AppConfig Configuration Profile ID',
      exportName: `${config.environmentName}-appconfig-profile-id`,
    });
  }

  /**
   * Grant Lambda function permission to read AppConfig
   */
  public grantRead(grantee: iam.IGrantable): void {
    grantee.grantPrincipal.addToPrincipalPolicy(
      new iam.PolicyStatement({
        effect: iam.Effect.ALLOW,
        actions: [
          'appconfig:GetConfiguration',
          'appconfig:GetLatestConfiguration',
          'appconfig:StartConfigurationSession',
        ],
        resources: ['*'],
      })
    );
  }
}

2. Fine-Tuning Model Pipeline

2.1 Create the pipeline

This pipeline will create these AWS resources below:

S3 Buckets: training data S3 bucket and model artifact S3 bucket
SageMaker IAM role used for training job
SSM parameter store to store the endpoint version
EventBridge rule to trigger the process of when the model is approved
Lambda function to deploy the approved models
Cloudwatch logs

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import { PythonFunction } from '@aws-cdk/aws-lambda-python-alpha';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as events from 'aws-cdk-lib/aws-events';
import * as targets from 'aws-cdk-lib/aws-events-targets';
import * as ssm from 'aws-cdk-lib/aws-ssm';
import * as logs from 'aws-cdk-lib/aws-logs';
import { EnvironmentConfig } from './utils';

export interface TrainingPipelineStackProps extends cdk.StackProps {
  config: EnvironmentConfig;
}

export class TrainingPipelineStack extends cdk.Stack {
  public readonly trainingBucket: s3.Bucket;
  public readonly modelBucket: s3.Bucket;
  public readonly endpointParameter: ssm.StringParameter;

  constructor(scope: Construct, id: string, props: TrainingPipelineStackProps) {
    super(scope, id, props);

    const { config } = props;

    // ========================================
    // S3 Buckets for Training
    // ========================================

    // NOTE: Using DESTROY for cost-saving during development
    // For production, change to RETAIN to preserve training data and models
    this.trainingBucket = new s3.Bucket(this, 'TrainingDataBucket', {
      bucketName: `training-data-${config.environmentName}-${cdk.Aws.ACCOUNT_ID}`,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      autoDeleteObjects: true,
      versioned: true,
      encryption: s3.BucketEncryption.S3_MANAGED,
      lifecycleRules: [
        {
          id: 'DeleteOldVersions',
          noncurrentVersionExpiration: cdk.Duration.days(90),
        },
      ],
    });

    // NOTE: Using DESTROY for cost-saving during development
    // For production, change to RETAIN to preserve model artifacts
    this.modelBucket = new s3.Bucket(this, 'ModelArtifactsBucket', {
      bucketName: `model-artifacts-${config.environmentName}-${cdk.Aws.ACCOUNT_ID}`,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      autoDeleteObjects: true,
      versioned: true,
      encryption: s3.BucketEncryption.S3_MANAGED,
    });

    // ========================================
    // Parameter Store for Active Endpoint
    // ========================================

    this.endpointParameter = new ssm.StringParameter(this, 'ActiveEndpointParameter', {
      parameterName: `/summarizer/${config.environmentName}/active-endpoint`,
      stringValue: 'none',
      description: 'Active SageMaker endpoint name for inference',
      tier: ssm.ParameterTier.STANDARD,
    });

    // ========================================
    // IAM Role for SageMaker Training
    // ========================================

    const sagemakerRole = new iam.Role(this, 'SageMakerTrainingRole', {
      assumedBy: new iam.ServicePrincipal('sagemaker.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonSageMakerFullAccess'),
      ],
    });

    this.trainingBucket.grantReadWrite(sagemakerRole);
    this.modelBucket.grantReadWrite(sagemakerRole);

    // ========================================
    // Lambda: Update Endpoint on Model Approval
    // ========================================

    const updateEndpointLogGroup = new logs.LogGroup(this, 'UpdateEndpointLogGroup', {
      logGroupName: `/aws/lambda/update-endpoint-${config.environmentName}`,
      retention: logs.RetentionDays.ONE_WEEK,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    const updateEndpointFn = new PythonFunction(this, 'UpdateEndpointFunction', {
      functionName: `update-endpoint-${config.environmentName}`,
      entry: 'src/lambdas/update-endpoint',
      runtime: lambda.Runtime.PYTHON_3_11,
      index: 'handler.py',
      handler: 'handler',
      description: `Update SageMaker endpoint for ${config.environmentName}`,
      timeout: cdk.Duration.minutes(5),
      memorySize: 256,
      environment: {
        PARAMETER_NAME: this.endpointParameter.parameterName,
        ENVIRONMENT: config.environmentName,
        SAGEMAKER_ROLE_ARN: sagemakerRole.roleArn,
      },
      logGroup: updateEndpointLogGroup,
    });

    // Grant permissions
    this.endpointParameter.grantRead(updateEndpointFn);
    this.endpointParameter.grantWrite(updateEndpointFn);

    updateEndpointFn.addToRolePolicy(new iam.PolicyStatement({
      effect: iam.Effect.ALLOW,
      actions: [
        'sagemaker:DescribeModelPackage',
        'sagemaker:CreateModel',
        'sagemaker:CreateEndpoint',
        'sagemaker:CreateEndpointConfig',
        'sagemaker:UpdateEndpoint',
        'sagemaker:DescribeEndpoint',
      ],
      resources: ['*'],
    }));

    // Grant permission to pass the SageMaker execution role
    updateEndpointFn.addToRolePolicy(new iam.PolicyStatement({
      effect: iam.Effect.ALLOW,
      actions: ['iam:PassRole'],
      resources: [sagemakerRole.roleArn],
    }));

    // ========================================
    // EventBridge: Trigger on Model Approval
    // ========================================

    const modelApprovalRule = new events.Rule(this, 'ModelApprovalRule', {
      ruleName: `model-approval-${config.environmentName}`,
      description: 'Trigger endpoint update when SageMaker model is approved',
      eventPattern: {
        source: ['aws.sagemaker'],
        detailType: ['SageMaker Model Package State Change'],
        detail: {
          ModelApprovalStatus: ['Approved'],
        },
      },
    });

    modelApprovalRule.addTarget(new targets.LambdaFunction(updateEndpointFn));

    // ========================================
    // Outputs
    // ========================================

    new cdk.CfnOutput(this, 'TrainingBucketName', {
      value: this.trainingBucket.bucketName,
      description: 'S3 bucket for training data',
      exportName: `${config.environmentName}-training-bucket`,
    });

    new cdk.CfnOutput(this, 'ModelBucketName', {
      value: this.modelBucket.bucketName,
      description: 'S3 bucket for model artifacts',
      exportName: `${config.environmentName}-model-bucket`,
    });

    new cdk.CfnOutput(this, 'EndpointParameterName', {
      value: this.endpointParameter.parameterName,
      description: 'Parameter Store key for active endpoint',
      exportName: `${config.environmentName}-endpoint-parameter`,
    });

    new cdk.CfnOutput(this, 'SageMakerRoleArn', {
      value: sagemakerRole.roleArn,
      description: 'IAM role for SageMaker training jobs',
      exportName: `${config.environmentName}-sagemaker-role`,
    });
  }
}

2.2 Create the scripts:

Prepare training datasets
Create a python script to download the datasets. The datasets can be downloaded from huggingface, amazon reviews, or generate

#!/usr/bin/env python3
"""
Download and prepare training data from public datasets

This script downloads customer review data and formats it for SageMaker training.
It supports multiple sources:
1. Hugging Face Datasets (recommended - easy and reliable)
2. Amazon Customer Reviews (real data from AWS Open Data Registry)
3. Synthetic data (generated for testing)

Output: training_data/ folder with train.jsonl, validation.jsonl, test.jsonl

Usage:
    # Download from Hugging Face (recommended)
    python scripts/download_training_data.py --source huggingface --dataset amazon_polarity --num-samples 5000

    # Generate synthetic data for testing
    python scripts/download_training_data.py --source synthetic --num-samples 1000

    # Download real Amazon reviews
    python scripts/download_training_data.py --source amazon --max-samples 5000
"""

import os
import json
import gzip
import argparse
import urllib.request
import ssl
from pathlib import Path
from typing import List, Dict
import random

# Fix SSL certificate verification issue on macOS
ssl._create_default_https_context = ssl._create_unverified_context


def download_huggingface_dataset(
    output_dir: Path, dataset_name: str = "amazon_polarity", max_samples: int = 5000
):
    """
    Download dataset from Hugging Face
    Source: https://huggingface.co/datasets

    Popular datasets:
    - amazon_polarity: Amazon reviews (positive/negative) - NO SUMMARIES
    - yelp_review_full: Yelp reviews with 1-5 star ratings - NO SUMMARIES
    - imdb: Movie reviews - NO SUMMARIES
    - rotten_tomatoes: Movie reviews - NO SUMMARIES
    - app_reviews: Mobile app reviews - NO SUMMARIES
    - cnn_dailymail: News articles WITH SUMMARIES (recommended for summarization)
    - xsum: News WITH SUMMARIES (extreme summarization)
    - samsum: Dialogues WITH SUMMARIES
    """
    print(f"\n📦 Downloading from Hugging Face: {dataset_name}")
    print(f"This may take a few minutes...")

    try:
        from datasets import load_dataset
    except ImportError:
        print("\n❌ Error: 'datasets' library not installed")
        print("Install it with: pip install datasets")
        return []

    try:
        # Load dataset with config if needed
        print(f"Loading dataset '{dataset_name}'...")

        # Datasets that need config versions
        if dataset_name == 'cnn_dailymail':
            dataset = load_dataset(dataset_name, '3.0.0')
        elif dataset_name == 'xsum':
            dataset = load_dataset(dataset_name)
        elif dataset_name == 'samsum':
            dataset = load_dataset(dataset_name)
        else:
            # Regular datasets (reviews)
            dataset = load_dataset(dataset_name)

        # Get train split
        train_data = dataset["train"]

        # Process samples
        reviews = []
        count = 0

        print(f"Processing samples...")
        for item in train_data:
            if count >= max_samples:
                break

            # Handle summarization datasets differently
            if dataset_name == 'cnn_dailymail':
                text = item.get('article', '')
                summary = item.get('highlights', '')
                sentiment = 'neutral'
            elif dataset_name == 'xsum':
                text = item.get('document', '')
                summary = item.get('summary', '')
                sentiment = 'neutral'
            elif dataset_name == 'samsum':
                text = item.get('dialogue', '')
                summary = item.get('summary', '')
                sentiment = 'neutral'
            else:
                # Review datasets - extract text and label
                text = None
                label = None

                # Try common field names
                if "content" in item:
                    text = item["content"]
                elif "text" in item:
                    text = item["text"]
                elif "review" in item:
                    text = item["review"]

                if "label" in item:
                    label = item["label"]
                elif "sentiment" in item:
                    label = item["sentiment"]
                elif "stars" in item:
                    label = item["stars"]

                if not text:
                    continue

                # Skip very short reviews
                if len(text) < 50:
                    continue

                # Determine sentiment from label
                sentiment = "neutral"
                if isinstance(label, int):
                    if label >= 4 or label == 1:  # 5-star or positive binary
                        sentiment = "positive"
                    elif label <= 2 or label == 0:  # 1-2 star or negative binary
                        sentiment = "negative"
                    else:
                        sentiment = "neutral"
                elif isinstance(label, str):
                    sentiment = label.lower()

                # Create summary (first 150 chars or extract key points)
                # NOTE: This is NOT a real summary, just for demo purposes
                summary = create_summary_from_text(text)

            # Skip if no text or summary
            if not text or not summary or len(text) < 50:
                continue

            reviews.append(
                {
                    "text": text,
                    "summary": summary,
                    "sentiment": sentiment,
                    "source": dataset_name,
                }
            )

            count += 1
            if count % 500 == 0:
                print(f"Processed {count} samples...")

        print(f"✅ Processed {len(reviews)} samples from Hugging Face dataset")
        return reviews

    except Exception as e:
        print(f"\n⚠️  Error loading dataset: {str(e)}")
        print(f"\nAvailable datasets:")
        print("  Summarization (recommended):")
        print("    - cnn_dailymail (news articles with summaries)")
        print("    - xsum (news with one-sentence summaries)")
        print("    - samsum (dialogues with summaries)")
        print("  Reviews (no real summaries):")
        print("    - amazon_polarity")
        print("    - yelp_review_full")
        print("    - imdb")
        print("    - rotten_tomatoes")
        print("    - app_reviews")
        print(
            "\nTry: python scripts/download_training_data.py --source huggingface --dataset cnn_dailymail"
        )
        return []


def create_summary_from_text(text: str, max_length: int = 150) -> str:
    """
    Create a simple summary from review text
    Takes first sentence or first N characters
    """
    # Try to get first sentence
    sentences = text.split(".")
    if sentences and len(sentences[0]) > 20:
        summary = sentences[0].strip() + "."
        if len(summary) <= max_length:
            return summary

    # Otherwise, take first N characters
    if len(text) <= max_length:
        return text

    return text[:max_length].rsplit(" ", 1)[0] + "..."


def download_file(url: str, output_path: str):
    """Download file from URL with progress"""
    print(f"Downloading from {url}...")

    def progress_hook(count, block_size, total_size):
        percent = int(count * block_size * 100 / total_size)
        print(f"\rProgress: {percent}%", end="", flush=True)

    urllib.request.urlretrieve(url, output_path, progress_hook)
    print("\nDownload complete!")


def download_amazon_reviews(
    output_dir: Path, category: str = "Electronics", max_samples: int = 10000
):
    """
    Download Amazon Customer Reviews dataset
    Source: https://registry.opendata.aws/amazon-reviews/
    """
    print(f"\n📦 Downloading Amazon Reviews - {category} category")
    print(f"This may take a few minutes...")

    # Amazon Reviews Open Data URLs
    base_url = "https://s3.amazonaws.com/amazon-reviews-pds/tsv"
    filename = f"amazon_reviews_us_{category}_v1_00.tsv.gz"
    url = f"{base_url}/{filename}"

    # Download
    temp_file = output_dir / filename

    try:
        download_file(url, str(temp_file))
    except Exception as e:
        print(f"\n⚠️  Download failed: {str(e)}")
        print(f"\nTrying alternative method using AWS CLI...")

        # Try using AWS CLI as fallback
        import subprocess

        try:
            result = subprocess.run(
                [
                    "aws",
                    "s3",
                    "cp",
                    f"s3://amazon-reviews-pds/tsv/{filename}",
                    str(temp_file),
                ],
                capture_output=True,
                text=True,
            )
            if result.returncode != 0:
                print(f"AWS CLI also failed: {result.stderr}")
                print(f"\n💡 Tip: You can manually download from:")
                print(f"   {url}")
                print(f"   Save to: {temp_file}")
                return []
        except FileNotFoundError:
            print(f"AWS CLI not found. Please install it or download manually from:")
            print(f"   {url}")
            return []

    # Parse and convert to JSONL
    print(f"\nProcessing reviews...")
    reviews = []

    with gzip.open(temp_file, "rt", encoding="utf-8") as f:
        # Skip header
        header = f.readline().strip().split("\t")

        # Find column indices
        try:
            review_idx = header.index("review_body")
            headline_idx = header.index("review_headline")
            rating_idx = header.index("star_rating")
        except ValueError as e:
            print(f"Error: Could not find required columns in dataset")
            return []

        count = 0
        for line in f:
            if count >= max_samples:
                break

            try:
                fields = line.strip().split("\t")
                if len(fields) <= max(review_idx, headline_idx, rating_idx):
                    continue

                review_text = fields[review_idx]
                headline = fields[headline_idx]
                rating = int(fields[rating_idx])

                # Skip empty reviews
                if not review_text or len(review_text) < 50:
                    continue

                # Determine sentiment from rating
                if rating >= 4:
                    sentiment = "positive"
                elif rating <= 2:
                    sentiment = "negative"
                else:
                    sentiment = "neutral"

                # Use headline as summary (not perfect but works for training)
                # In production, you'd want human-written summaries
                summary = headline if headline else review_text[:100]

                reviews.append(
                    {
                        "text": review_text,
                        "summary": summary,
                        "sentiment": sentiment,
                        "rating": rating,
                    }
                )

                count += 1
                if count % 1000 == 0:
                    print(f"Processed {count} reviews...")

            except Exception as e:
                continue

    # Clean up temp file
    temp_file.unlink()

    print(f"✅ Processed {len(reviews)} reviews from Amazon dataset")
    return reviews


def create_synthetic_data(num_samples: int = 1000) -> List[Dict]:
    """
    Create synthetic training data for testing
    Use this if you can't download real data
    """
    print(f"\n🔧 Generating {num_samples} synthetic reviews...")

    templates = {
        "positive": [
            (
                "This product is absolutely amazing! {feature1} and {feature2}. Highly recommend to anyone looking for quality.",
                "Excellent product with great {feature1} and {feature2}. Highly recommended.",
            ),
            (
                "I'm very impressed with this purchase. The {feature1} exceeded my expectations and {feature2}. Worth every penny!",
                "Very satisfied with {feature1} and {feature2}. Great value.",
            ),
            (
                "Outstanding quality! {feature1} is incredible and {feature2}. Best purchase I've made this year.",
                "Outstanding {feature1} and {feature2}. Excellent purchase.",
            ),
        ],
        "negative": [
            (
                "Very disappointed with this product. {issue1} and {issue2}. Would not recommend.",
                "Poor quality with {issue1} and {issue2}. Not recommended.",
            ),
            (
                "This is a waste of money. {issue1} after just a few days and {issue2}. Terrible experience.",
                "Product failed quickly with {issue1} and {issue2}. Waste of money.",
            ),
            (
                "Do not buy this! {issue1} and {issue2}. Customer service was unhelpful too.",
                "Major issues with {issue1} and {issue2}. Poor support.",
            ),
        ],
        "neutral": [
            (
                "It's okay for the price. {aspect1} but {aspect2}. Nothing special.",
                "Average product. {aspect1} but {aspect2}.",
            ),
            (
                "Does what it's supposed to do. {aspect1} though {aspect2}. Acceptable.",
                "Functional product. {aspect1} with {aspect2}.",
            ),
            (
                "Mixed feelings about this. {aspect1} but {aspect2}. Could be better.",
                "Mixed quality. {aspect1} but {aspect2}.",
            ),
        ],
    }

    features = [
        "The battery life is excellent",
        "The build quality feels premium",
        "The performance is outstanding",
        "The design is beautiful",
        "The screen quality is amazing",
        "The sound quality is superb",
        "The camera takes great photos",
        "The speed is impressive",
    ]

    issues = [
        "It stopped working",
        "The battery drains quickly",
        "The build quality is poor",
        "It feels cheap and flimsy",
        "The performance is sluggish",
        "It overheats constantly",
        "The screen is dim",
        "The sound quality is terrible",
    ]

    aspects = [
        "The price is reasonable",
        "It works as advertised",
        "The design is acceptable",
        "The features are basic",
        "The quality is average",
        "The performance is adequate",
    ]

    reviews = []
    sentiments = ["positive", "negative", "neutral"]

    for i in range(num_samples):
        sentiment = random.choice(sentiments)
        template_text, template_summary = random.choice(templates[sentiment])

        if sentiment == "positive":
            text = template_text.format(
                feature1=random.choice(features), feature2=random.choice(features)
            )
            summary = template_summary.format(
                feature1=random.choice(features).lower(),
                feature2=random.choice(features).lower(),
            )
        elif sentiment == "negative":
            text = template_text.format(
                issue1=random.choice(issues), issue2=random.choice(issues)
            )
            summary = template_summary.format(
                issue1=random.choice(issues).lower(),
                issue2=random.choice(issues).lower(),
            )
        else:
            text = template_text.format(
                aspect1=random.choice(aspects), aspect2=random.choice(aspects)
            )
            summary = template_summary.format(
                aspect1=random.choice(aspects).lower(),
                aspect2=random.choice(aspects).lower(),
            )

        reviews.append({"text": text, "summary": summary, "sentiment": sentiment})

    print(f"✅ Generated {len(reviews)} synthetic reviews")
    return reviews


def split_and_save_data(
    reviews: List[Dict], output_dir: Path, train_ratio=0.8, val_ratio=0.1
):
    """Split data into train/val/test and save as JSONL"""

    # Shuffle
    random.shuffle(reviews)

    # Calculate splits
    total = len(reviews)
    train_size = int(total * train_ratio)
    val_size = int(total * val_ratio)

    train_data = reviews[:train_size]
    val_data = reviews[train_size : train_size + val_size]
    test_data = reviews[train_size + val_size :]

    # Save files
    output_dir.mkdir(parents=True, exist_ok=True)

    def save_jsonl(data, filename):
        filepath = output_dir / filename
        with open(filepath, "w") as f:
            for item in data:
                f.write(json.dumps(item) + "\n")
        print(f"  ✓ {filename}: {len(data)} samples")

    print(f"\n💾 Saving data to {output_dir}/")
    save_jsonl(train_data, "train.jsonl")
    save_jsonl(val_data, "validation.jsonl")
    save_jsonl(test_data, "test.jsonl")

    print(f"\n📊 Data split:")
    print(f"  Training:   {len(train_data)} samples ({train_ratio*100:.0f}%)")
    print(f"  Validation: {len(val_data)} samples ({val_ratio*100:.0f}%)")
    print(
        f"  Test:       {len(test_data)} samples ({(1-train_ratio-val_ratio)*100:.0f}%)"
    )


def main():
    parser = argparse.ArgumentParser(
        description="Download and prepare training data for review summarization",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  # Download from Hugging Face (recommended)
  python scripts/download_training_data.py --source huggingface --dataset amazon_polarity --num-samples 5000

  # Download different Hugging Face dataset
  python scripts/download_training_data.py --source huggingface --dataset yelp_review_full --num-samples 3000

  # Download real Amazon reviews (Electronics)
  python scripts/download_training_data.py --source amazon --max-samples 5000

  # Generate synthetic data for testing
  python scripts/download_training_data.py --source synthetic --num-samples 1000

  # Custom output directory
  python scripts/download_training_data.py --source huggingface --dataset imdb --output-dir my_data/
        """,
    )

    parser.add_argument(
        "--source",
        type=str,
        default="huggingface",
        choices=["huggingface", "amazon", "synthetic"],
        help="Data source (default: huggingface)",
    )
    parser.add_argument(
        "--dataset",
        type=str,
        default="amazon_polarity",
        help="Hugging Face dataset name (default: amazon_polarity)",
    )
    parser.add_argument(
        "--output-dir",
        type=str,
        default="training_data",
        help="Output directory (default: training_data)",
    )
    parser.add_argument(
        "--max-samples",
        type=int,
        default=5000,
        help="Max samples to download from Amazon (default: 5000)",
    )
    parser.add_argument(
        "--num-samples",
        type=int,
        default=5000,
        help="Number of samples to generate/download (default: 5000)",
    )
    parser.add_argument(
        "--category",
        type=str,
        default="Electronics",
        help="Amazon reviews category (default: Electronics)",
    )

    args = parser.parse_args()

    output_dir = Path(args.output_dir)

    print("=" * 60)
    print("📚 Training Data Preparation")
    print("=" * 60)

    # Get data based on source
    if args.source == "huggingface":
        reviews = download_huggingface_dataset(
            output_dir=output_dir,
            dataset_name=args.dataset,
            max_samples=args.num_samples,
        )
        if not reviews:
            print("\n❌ Failed to download from Hugging Face.")
            print("Please check your internet connection or try a different dataset.")
            return
    elif args.source == "amazon":
        reviews = download_amazon_reviews(
            output_dir=output_dir, category=args.category, max_samples=args.max_samples
        )
        if not reviews:
            print("\n❌ Failed to download Amazon reviews.")
            print("Please check your internet connection or AWS CLI configuration.")
            return
    else:
        reviews = create_synthetic_data(args.num_samples)

    # Split and save
    if reviews:
        split_and_save_data(reviews, output_dir)

        print("\n" + "=" * 60)
        print("✅ Data preparation complete!")
        print("=" * 60)
        print(f"\nNext steps:")
        print(f"1. Review the data in {output_dir}/")
        print(f"2. Upload to S3:")
        print(f"   python scripts/upload_training_data.py")
        print(f"3. Start training:")
        print(f"   python scripts/start_training.py")
    else:
        print("\n❌ No data was generated")

if __name__ == "__main__":
    main()

The dataset will be downloaded from the source indicated when running the script, if not it will get from hugging face.
Since we use instruction fine-tuning, the dataset will be format as:

                {
                    "text": text,
                    "summary": summary,
                    "sentiment": sentiment,
                    "source": dataset_name,
                }

before splitting into training, test and validation datasets
Upload to S3 bucket
Let’s create an script to help us upload the datasets into S3 bucket

#!/usr/bin/env python3
"""
Upload training data to S3 training bucket

This script uploads your prepared training data to the S3 bucket created by
the training pipeline stack. It automatically finds the correct bucket name
from CloudFormation outputs.

Prerequisites:
    1. Deploy training pipeline: cdk deploy TrainingPipeline
    2. Prepare data: python scripts/download_training_data.py

Usage:
    # Upload data for kate environment
    python scripts/upload_training_data.py

    # Upload for different environment
    python scripts/upload_training_data.py --environment dev
"""

import boto3
import argparse
from pathlib import Path
import os


def get_training_bucket(environment='kate'):
    """Get training bucket name from CloudFormation stack"""
    cfn = boto3.client('cloudformation')
    stack_name = f'training-pipeline-{environment}'

    try:
        response = cfn.describe_stacks(StackName=stack_name)
        outputs = response['Stacks'][0]['Outputs']

        for output in outputs:
            if output['OutputKey'] == 'TrainingBucketName':
                return output['OutputValue']

        print(f"❌ Error: Could not find TrainingBucketName in stack outputs")
        return None

    except Exception as e:
        print(f"❌ Error: Could not find stack '{stack_name}'")
        print(f"Make sure you've deployed the training pipeline first:")
        print(f"  cdk deploy TrainingPipeline")
        return None


def upload_directory(local_dir: Path, bucket_name: str, s3_prefix: str = ''):
    """Upload directory contents to S3"""
    s3 = boto3.client('s3')

    if not local_dir.exists():
        print(f"❌ Error: Directory not found: {local_dir}")
        print(f"\nRun this first to download training data:")
        print(f"  python scripts/download_training_data.py")
        return False

    # Get list of files
    files = list(local_dir.glob('*.jsonl'))

    if not files:
        print(f"❌ Error: No .jsonl files found in {local_dir}")
        print(f"\nExpected files:")
        print(f"  - train.jsonl")
        print(f"  - validation.jsonl")
        print(f"  - test.jsonl")
        return False

    print(f"\n📤 Uploading {len(files)} files to s3://{bucket_name}/{s3_prefix}")
    print("=" * 60)

    uploaded = 0
    for file_path in files:
        s3_key = f"{s3_prefix}{file_path.name}" if s3_prefix else file_path.name

        try:
            # Get file size
            file_size = file_path.stat().st_size
            file_size_mb = file_size / (1024 * 1024)

            print(f"  Uploading {file_path.name} ({file_size_mb:.2f} MB)...", end='', flush=True)

            # Upload with progress
            s3.upload_file(
                str(file_path),
                bucket_name,
                s3_key,
                Callback=lambda bytes_transferred: None
            )

            print(" ✓")
            uploaded += 1

        except Exception as e:
            print(f" ✗")
            print(f"    Error: {str(e)}")

    print("=" * 60)
    print(f"✅ Uploaded {uploaded}/{len(files)} files successfully")

    return uploaded == len(files)


def verify_upload(bucket_name: str, s3_prefix: str = ''):
    """Verify files were uploaded correctly"""
    s3 = boto3.client('s3')

    print(f"\n🔍 Verifying upload...")

    try:
        response = s3.list_objects_v2(
            Bucket=bucket_name,
            Prefix=s3_prefix
        )

        if 'Contents' not in response:
            print("❌ No files found in bucket")
            return False

        print(f"\n📁 Files in s3://{bucket_name}/{s3_prefix}")
        print("=" * 60)

        total_size = 0
        for obj in response['Contents']:
            key = obj['Key']
            size = obj['Size']
            size_mb = size / (1024 * 1024)
            total_size += size
            print(f"  ✓ {key} ({size_mb:.2f} MB)")

        total_size_mb = total_size / (1024 * 1024)
        print("=" * 60)
        print(f"Total: {len(response['Contents'])} files, {total_size_mb:.2f} MB")

        return True

    except Exception as e:
        print(f"❌ Error verifying upload: {str(e)}")
        return False


def main():
    parser = argparse.ArgumentParser(
        description='Upload training data to S3',
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  # Upload data for kate environment
  python scripts/upload_training_data.py

  # Upload for different environment
  python scripts/upload_training_data.py --environment dev

  # Upload from custom directory
  python scripts/upload_training_data.py --data-dir my_data/

  # Upload to specific S3 prefix
  python scripts/upload_training_data.py --s3-prefix data/v1/
        """
    )

    parser.add_argument('--environment', type=str, default='kate',
                        help='Environment name (default: kate)')
    parser.add_argument('--data-dir', type=str, default='training_data',
                        help='Local data directory (default: training_data)')
    parser.add_argument('--s3-prefix', type=str, default='',
                        help='S3 prefix/folder (default: root)')

    args = parser.parse_args()

    print("=" * 60)
    print("📤 Upload Training Data to S3")
    print("=" * 60)

    # Get training bucket
    print(f"\n🔍 Looking up training bucket for environment: {args.environment}")
    bucket_name = get_training_bucket(args.environment)

    if not bucket_name:
        return

    print(f"✓ Found bucket: {bucket_name}")

    # Upload files
    local_dir = Path(args.data_dir)
    success = upload_directory(local_dir, bucket_name, args.s3_prefix)

    if not success:
        return

    # Verify upload
    verify_upload(bucket_name, args.s3_prefix)

    print("\n" + "=" * 60)
    print("✅ Upload complete!")
    print("=" * 60)
    print(f"\nNext steps:")
    print(f"1. Start training job:")
    print(f"   python scripts/start_training.py --environment {args.environment}")
    print(f"\n2. Monitor training:")
    print(f"   - AWS Console: https://console.aws.amazon.com/sagemaker/home#/jobs")
    print(f"   - CLI: aws sagemaker list-training-jobs --sort-by CreationTime --sort-order Descending")


if __name__ == '__main__':
    main()

Training script

#!/usr/bin/env python3
"""
Start a SageMaker training job for fine-tuning review summarization model

This script starts a SageMaker training job that fine-tunes a T5 or DistilBERT
model on your review data. It automatically configures the job using resources
from your deployed training pipeline stack.

Prerequisites:
    1. Deploy training pipeline: cdk deploy TrainingPipeline
    2. Prepare data: python scripts/download_training_data.py
    3. Upload data: python scripts/upload_training_data.py

Usage:
    # Start training with defaults (t5-small, 3 epochs, ml.g4dn.xlarge GPU)
    python scripts/start_training.py

    # Custom hyperparameters
    python scripts/start_training.py --epochs 5 --batch-size 16 --learning-rate 3e-5

    # Use GPU for faster training
    python scripts/start_training.py --instance-type ml.p3.2xlarge
"""

import boto3
import argparse
from datetime import datetime
import os

# Get region from environment or use default
REGION = os.environ.get('AWS_REGION') or os.environ.get('AWS_DEFAULT_REGION') or 'ap-southeast-2'

sagemaker_client = boto3.client('sagemaker', region_name=REGION)
cfn = boto3.client('cloudformation', region_name=REGION)
s3 = boto3.client('s3', region_name=REGION)
sts = boto3.client('sts', region_name=REGION)


def get_stack_outputs(stack_name):
    """Get outputs from CloudFormation stack"""
    response = cfn.describe_stacks(StackName=stack_name)
    outputs = {}
    for output in response['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs


def upload_training_code(model_bucket):
    """Upload training script to S3"""
    import tarfile
    import tempfile
    import os

    # Create a temporary tar.gz file with the training code
    with tempfile.NamedTemporaryFile(suffix='.tar.gz', delete=False) as tmp:
        tmp_path = tmp.name

    try:
        with tarfile.open(tmp_path, 'w:gz') as tar:
            tar.add('sagemaker/train.py', arcname='train.py')
            tar.add('sagemaker/requirements.txt', arcname='requirements.txt')

        # Upload to S3
        timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
        s3_key = f'code/sourcedir-{timestamp}.tar.gz'
        s3.upload_file(tmp_path, model_bucket, s3_key)

        return f's3://{model_bucket}/{s3_key}'
    finally:
        if os.path.exists(tmp_path):
            os.remove(tmp_path)


def get_training_image():
    """Get the PyTorch training container image for the current region"""
    region = boto3.session.Session().region_name

    # PyTorch 2.0 training image
    pytorch_version = '2.0.1'
    python_version = 'py310'

    # ECR image URI format
    image_uri = f'763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training:{pytorch_version}-gpu-{python_version}-cu118-ubuntu20.04-sagemaker'

    return image_uri


def start_training_job(
    environment='kate',
    model_name='t5-small',
    epochs=3,
    batch_size=8,
    learning_rate=2e-5,
    instance_type='ml.m5.xlarge',
    use_lora=True,
    lora_r=8,
    lora_alpha=32,
    lora_dropout=0.1
):
    """Start a SageMaker training job"""

    # Get stack outputs
    stack_name = f'training-pipeline-{environment}'
    print(f"Getting outputs from stack: {stack_name}")

    try:
        outputs = get_stack_outputs(stack_name)
    except Exception:
        print(f"Error: Could not find stack '{stack_name}'")
        print("Make sure you've deployed the training pipeline first:")
        print("  cdk deploy TrainingPipeline")
        return

    training_bucket = outputs['TrainingBucketName']
    model_bucket = outputs['ModelBucketName']
    sagemaker_role = outputs['SageMakerRoleArn']

    print(f"Training bucket: {training_bucket}")
    print(f"Model bucket: {model_bucket}")
    print(f"SageMaker role: {sagemaker_role}")

    # Upload training code to S3
    print("\nUploading training code to S3...")
    source_code_uri = upload_training_code(model_bucket)
    print(f"Training code uploaded to: {source_code_uri}")

    # Generate job name with timestamp
    timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
    job_name = f'review-summarizer-{environment}-{timestamp}'

    # Training job configuration
    training_config = {
        'TrainingJobName': job_name,
        'RoleArn': sagemaker_role,
        'AlgorithmSpecification': {
            'TrainingImage': get_training_image(),
            'TrainingInputMode': 'File',
        },
        'InputDataConfig': [
            {
                'ChannelName': 'training',
                'DataSource': {
                    'S3DataSource': {
                        'S3DataType': 'S3Prefix',
                        'S3Uri': f's3://{training_bucket}/',
                        'S3DataDistributionType': 'FullyReplicated',
                    }
                },
                'ContentType': 'application/json',
                'CompressionType': 'None',
            }
        ],
        'OutputDataConfig': {
            'S3OutputPath': f's3://{model_bucket}/models/',
        },
        'ResourceConfig': {
            'InstanceType': instance_type,
            'InstanceCount': 1,
            'VolumeSizeInGB': 30,
        },
        'StoppingCondition': {
            'MaxRuntimeInSeconds': 86400,  # 24 hours
        },
        'HyperParameters': {
            'sagemaker_program': 'train.py',
            'sagemaker_submit_directory': source_code_uri,
            'epochs': str(epochs),
            'batch_size': str(batch_size),
            'learning_rate': str(learning_rate),
            'model_name': model_name,
            'use_lora': str(use_lora).lower(),
            'lora_r': str(lora_r),
            'lora_alpha': str(lora_alpha),
            'lora_dropout': str(lora_dropout),
        },
        'Tags': [
            {'Key': 'Environment', 'Value': environment},
            {'Key': 'Project', 'Value': 'review-summarizer'},
        ],
    }

    print(f"\nStarting training job: {job_name}")
    print(f"Model: {model_name}")
    print(f"Instance: {instance_type}")
    print(f"Training method: {'LoRA (Parameter-Efficient)' if use_lora else 'Full Fine-Tuning'}")
    print(f"Hyperparameters:")
    print(f"  - Epochs: {epochs}")
    print(f"  - Batch size: {batch_size}")
    print(f"  - Learning rate: {learning_rate}")
    if use_lora:
        print(f"  - LoRA rank: {lora_r}")
        print(f"  - LoRA alpha: {lora_alpha}")
        print(f"  - LoRA dropout: {lora_dropout}")

    try:
        response = sagemaker_client.create_training_job(**training_config)
        print(f"\n✅ Training job started successfully!")
        print(f"Job ARN: {response['TrainingJobArn']}")
        print(f"\nMonitor progress:")
        region = boto3.session.Session().region_name
        print(f"  - AWS Console: https://{region}.console.aws.amazon.com/sagemaker/home?region={region}#/jobs/{job_name}")
        print(f"  - CLI: aws sagemaker describe-training-job --training-job-name {job_name}")
        print(f"\nView logs:")
        print(f"  aws logs tail /aws/sagemaker/TrainingJobs --follow --log-stream-name-prefix {job_name}")

    except Exception as e:
        print(f"\n❌ Error starting training job: {str(e)}")
        print("\nTroubleshooting:")
        print(f"1. Make sure training data exists in s3://{training_bucket}/")
        print("2. Check IAM role has necessary permissions")
        print("3. Verify the training image is available in your region")
        print("4. Check training script exists: sagemaker/train.py")





if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description='Start SageMaker training job for review summarization',
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  # Start training with defaults
  python scripts/start_training.py

  # Custom hyperparameters
  python scripts/start_training.py --epochs 5 --batch-size 16

  # Use larger instance
  python scripts/start_training.py --instance-type ml.p3.2xlarge

  # Different environment
  python scripts/start_training.py --environment dev
        """
    )

    parser.add_argument('--environment', type=str, default='kate',
                        help='Environment name (default: kate)')
    parser.add_argument('--model-name', type=str, default='t5-small',
                        help='Base model to fine-tune (default: t5-small)')
    parser.add_argument('--epochs', type=int, default=3,
                        help='Number of training epochs (default: 3)')
    parser.add_argument('--batch-size', type=int, default=8,
                        help='Training batch size (default: 8)')
    parser.add_argument('--learning-rate', type=float, default=2e-5,
                        help='Learning rate (default: 2e-5)')
    parser.add_argument('--instance-type', type=str, default='ml.g4dn.xlarge',
                        help='SageMaker instance type (default: ml.g4dn.xlarge)')
    parser.add_argument('--use-lora', action='store_true', default=True,
                        help='Enable LoRA fine-tuning (default: True)')
    parser.add_argument('--no-lora', dest='use_lora', action='store_false',
                        help='Disable LoRA and use full fine-tuning')
    parser.add_argument('--lora-r', type=int, default=8,
                        help='LoRA rank (default: 8)')
    parser.add_argument('--lora-alpha', type=int, default=32,
                        help='LoRA alpha scaling (default: 32)')
    parser.add_argument('--lora-dropout', type=float, default=0.1,
                        help='LoRA dropout (default: 0.1)')

    args = parser.parse_args()

    start_training_job(
        environment=args.environment,
        model_name=args.model_name,
        epochs=args.epochs,
        batch_size=args.batch_size,
        learning_rate=args.learning_rate,
        instance_type=args.instance_type,
        use_lora=args.use_lora,
        lora_r=args.lora_r,
        lora_alpha=args.lora_alpha,
        lora_dropout=args.lora_dropout,
    )

3. Inference Pipeline

3.1 Create the pipeline

This pipeline will create these AWS resources below:

S3 result bucket
API Gateway
Lambda function
IAM Roles
Cloudwatch logs

/**
 * Inference Pipeline Stack
 * 
 * This stack creates the infrastructure for online review summarization.
 * It implements a multi-stage processing pipeline:
 * 
 * 1. API Gateway - REST API endpoint for incoming requests
 * 2. Lambda Orchestrator - Coordinates the summarization pipeline
 * 3. Amazon Bedrock - Generates fast, general-purpose summaries
 * 4. Amazon OpenSearch - Retrieves relevant context via RAG (optional)
 * 5. SageMaker Endpoint - Refines summary with fine-tuned model (optional)
 * 6. S3 Results Bucket - Stores final summaries and metadata
 * 
 * Request Flow:
 * POST /summarize → Lambda → Bedrock → OpenSearch → SageMaker → S3 → Response
 * 
 * The Lambda function reads the active SageMaker endpoint from Parameter Store,
 * enabling zero-downtime model updates when new versions are deployed.
 */

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import { PythonFunction } from '@aws-cdk/aws-lambda-python-alpha';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as apigateway from 'aws-cdk-lib/aws-apigateway';
import * as logs from 'aws-cdk-lib/aws-logs';
import * as ssm from 'aws-cdk-lib/aws-ssm';
import { EnvironmentConfig } from './utils';

export interface InferencePipelineStackProps extends cdk.StackProps {
  config: EnvironmentConfig;
  endpointParameterName: string;
  appConfigApplicationId?: string;
  appConfigEnvironmentId?: string;
  appConfigProfileId?: string;
}

export class InferencePipelineStack extends cdk.Stack {
  public readonly api: apigateway.RestApi;
  public readonly resultsBucket: s3.Bucket;
  public readonly summarizerFunction: lambda.Function;

  constructor(scope: Construct, id: string, props: InferencePipelineStackProps) {
    super(scope, id, props);

    const { config, endpointParameterName, appConfigApplicationId, appConfigEnvironmentId, appConfigProfileId } = props;

    // ========================================
    // S3 Bucket for Results
    // ========================================

    // NOTE: Using DESTROY for cost-saving during development
    // Results are temporary and can be safely deleted
    this.resultsBucket = new s3.Bucket(this, 'ResultsBucket', {
      bucketName: `summarizer-results-${config.environmentName}-${cdk.Aws.ACCOUNT_ID}`,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      autoDeleteObjects: true,
      encryption: s3.BucketEncryption.S3_MANAGED,
      lifecycleRules: [
        {
          id: 'DeleteOldResults',
          expiration: cdk.Duration.days(30),
        },
      ],
    });

    // ========================================
    // Lambda: Main Summarizer Function
    // ========================================

    const summarizerLogGroup = new logs.LogGroup(this, 'SummarizerLogGroup', {
      logGroupName: `/aws/lambda/summarizer-${config.environmentName}`,
      retention: logs.RetentionDays.ONE_WEEK,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    const summarizerFn = new PythonFunction(this, 'SummarizerFunction', {
      functionName: `summarizer-${config.environmentName}`,
      entry: 'src/lambdas/summarizer',
      runtime: lambda.Runtime.PYTHON_3_11,
      index: 'handler.py',
      handler: 'handler',
      description: `Review summarization function for ${config.environmentName}`,
      timeout: cdk.Duration.seconds(120),
      memorySize: 1024,
      environment: {
        RESULTS_BUCKET: this.resultsBucket.bucketName,
        ENDPOINT_PARAMETER: endpointParameterName,
        ENVIRONMENT: config.environmentName,
        OPENSEARCH_ENDPOINT: process.env.OPENSEARCH_ENDPOINT || 'none',
        // AppConfig IDs (if provided)
        ...(appConfigApplicationId && { APPCONFIG_APPLICATION_ID: appConfigApplicationId }),
        ...(appConfigEnvironmentId && { APPCONFIG_ENVIRONMENT_ID: appConfigEnvironmentId }),
        ...(appConfigProfileId && { APPCONFIG_CONFIGURATION_PROFILE_ID: appConfigProfileId }),
      },
      logGroup: summarizerLogGroup,
    });

    // Expose Lambda function for AppConfig permissions
    this.summarizerFunction = summarizerFn;

    // Grant permissions
    this.resultsBucket.grantWrite(summarizerFn);

    summarizerFn.addToRolePolicy(new iam.PolicyStatement({
      effect: iam.Effect.ALLOW,
      actions: ['bedrock:InvokeModel'],
      resources: ['*'],
    }));

    summarizerFn.addToRolePolicy(new iam.PolicyStatement({
      effect: iam.Effect.ALLOW,
      actions: ['sagemaker:InvokeEndpoint'],
      resources: ['*'],
    }));

    summarizerFn.addToRolePolicy(new iam.PolicyStatement({
      effect: iam.Effect.ALLOW,
      actions: ['ssm:GetParameter'],
      resources: [
        `arn:aws:ssm:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:parameter${endpointParameterName}`,
      ],
    }));

    // OpenSearch permissions (if using)
    summarizerFn.addToRolePolicy(new iam.PolicyStatement({
      effect: iam.Effect.ALLOW,
      actions: [
        'aoss:APIAccessAll',
        'es:ESHttpGet',
        'es:ESHttpPost',
      ],
      resources: ['*'],
    }));

    // ========================================
    // API Gateway
    // ========================================

    if (config.enableApiGateway) {
      this.api = new apigateway.RestApi(this, 'SummarizerAPI', {
        restApiName: `review-summarizer-${config.environmentName}`,
        description: 'API for review summarization with RAG',
        deployOptions: {
          stageName: config.environmentName,
          loggingLevel: apigateway.MethodLoggingLevel.INFO,
          dataTraceEnabled: true,
          metricsEnabled: true,
        },
        defaultCorsPreflightOptions: {
          allowOrigins: apigateway.Cors.ALL_ORIGINS,
          allowMethods: apigateway.Cors.ALL_METHODS,
        },
      });

      // POST /summarize endpoint
      const summarize = this.api.root.addResource('summarize');
      summarize.addMethod('POST', new apigateway.LambdaIntegration(summarizerFn), {
        apiKeyRequired: false,
        requestValidator: new apigateway.RequestValidator(this, 'RequestValidator', {
          restApi: this.api,
          validateRequestBody: true,
        }),
      });

      // GET /health endpoint
      const health = this.api.root.addResource('health');
      health.addMethod('GET', new apigateway.MockIntegration({
        integrationResponses: [{
          statusCode: '200',
          responseTemplates: {
            'application/json': '{"status": "healthy"}',
          },
        }],
        requestTemplates: {
          'application/json': '{"statusCode": 200}',
        },
      }), {
        methodResponses: [{ statusCode: '200' }],
      });

      new cdk.CfnOutput(this, 'ApiUrl', {
        value: this.api.url,
        description: 'API Gateway URL',
        exportName: `${config.environmentName}-api-url`,
      });
    }

    // ========================================
    // Outputs
    // ========================================

    new cdk.CfnOutput(this, 'ResultsBucketName', {
      value: this.resultsBucket.bucketName,
      description: 'S3 bucket for summarization results',
      exportName: `${config.environmentName}-results-bucket`,
    });

    new cdk.CfnOutput(this, 'LambdaFunctionName', {
      value: summarizerFn.functionName,
      description: 'Lambda function for summarization',
      exportName: `${config.environmentName}-summarizer-function`,
    });
  }
}

3.2 Create the scripts

Lambda function

"""
Main Lambda function for review summarization pipeline

This function orchestrates a multi-stage summarization process with A/B testing support:

Stage 1: Amazon Bedrock
    - Generates fast, general-purpose summary
    - Uses Claude or other foundation models
    - Always runs (provides baseline summary)

Stage 2: RAG Retrieval (Optional)
    - Queries OpenSearch vector index for relevant context
    - Grounds summary in factual knowledge
    - Only runs if OpenSearch is configured

Stage 3: SageMaker Refinement (Optional with A/B Testing)
    - Selects model based on A/B testing rules
    - Calls fine-tuned model for domain-specific refinement
    - Extracts sentiment and confidence scores
    - Supports gradual rollouts and canary deployments

Stage 4: Storage
    - Saves results to S3 for audit trail
    - Returns JSON response to API Gateway

The function uses AWS AppConfig for dynamic A/B testing configuration,
enabling gradual model rollouts without code changes.
"""

import json
import os
import boto3
from datetime import datetime
import traceback
from appconfig_helper import (
    select_model_for_request,
    get_bedrock_config,
    log_ab_test_assignment
)
from bedrock_client import summarize_review

# Initialize AWS clients
bedrock_runtime = boto3.client('bedrock-runtime', region_name=os.environ.get('AWS_REGION', 'us-east-1'))
sagemaker_runtime = boto3.client('sagemaker-runtime')
s3_client = boto3.client('s3')
ssm_client = boto3.client('ssm')

RESULTS_BUCKET = os.environ['RESULTS_BUCKET']
ENDPOINT_PARAMETER = os.environ['ENDPOINT_PARAMETER']
BEDROCK_MODEL_ID = os.environ.get('BEDROCK_MODEL_ID', 'anthropic.claude-v2')
OPENSEARCH_ENDPOINT = os.environ.get('OPENSEARCH_ENDPOINT', 'none')


def handler(event, context):
    """
    Main handler for summarization requests

    Expected input:
    {
        "text": "Review text here...",
        "options": {
            "include_sentiment": true,
            "use_rag": true
        }
    }
    """
    try:
        # Parse input
        if 'body' in event:
            body = json.loads(event['body'])
        else:
            body = event

        text = body.get('text', '')
        options = body.get('options', {})

        # Extract request context for A/B testing
        request_context = {
            'category': body.get('category', 'general'),
            'userTier': body.get('userTier', 'standard'),
            'textLength': len(text),
            'userId': body.get('userId'),
            'requestId': context.request_id if hasattr(context, 'request_id') else datetime.now().isoformat(),
        }

        if not text:
            return {
                'statusCode': 400,
                'body': json.dumps({'error': 'Missing required field: text'})
            }

        request_id = request_context['requestId']

        # Step 1: Get initial summary from Bedrock using Converse API
        print(f"[{request_id}] Step 1: Calling Bedrock for initial summary")
        bedrock_config = get_bedrock_config()

        bedrock_response = summarize_review(
            text=text,
            model_id=bedrock_config.get('modelId', BEDROCK_MODEL_ID),
            max_tokens=bedrock_config.get('maxTokens', 200),
            temperature=bedrock_config.get('temperature', 0.5)
        )

        initial_summary = bedrock_response['text']

        # Log token usage
        usage = bedrock_response['usage']
        print(f"Bedrock usage - Input: {usage['inputTokens']}, Output: {usage['outputTokens']}")

        # Step 2: RAG retrieval (optional)
        context_text = ""
        if options.get('use_rag', False) and OPENSEARCH_ENDPOINT != 'none':
            print(f"[{request_id}] Step 2: Retrieving context from OpenSearch")
            context_text = retrieve_context(text)
        else:
            print(f"[{request_id}] Step 2: Skipping RAG (disabled or not configured)")

        # Step 3: Select model using A/B testing
        print(f"[{request_id}] Step 3: Selecting model via A/B testing")
        endpoint_name = select_model_for_request(request_context)

        # Log A/B test assignment
        log_ab_test_assignment(request_id, endpoint_name)

        # Step 4: Refine with fine-tuned model (if endpoint exists)
        final_summary = initial_summary
        sentiment = "neutral"
        confidence = 0.0

        if endpoint_name and endpoint_name != 'none' and endpoint_name != 'ensemble':
            print(f"[{request_id}] Step 4: Refining with SageMaker endpoint: {endpoint_name}")
            refinement = refine_with_sagemaker(
                endpoint_name=endpoint_name,
                summary=initial_summary,
                context=context_text,
                original_text=text
            )
            final_summary = refinement.get('summary', initial_summary)
            sentiment = refinement.get('sentiment', 'neutral')
            confidence = refinement.get('confidence', 0.0)
        elif endpoint_name == 'ensemble':
            print(f"[{request_id}] Step 4: Using multi-model ensemble")
            # TODO: Implement ensemble logic
            final_summary = initial_summary
            sentiment = "neutral"
            confidence = 0.0
        else:
            print(f"[{request_id}] Step 4: Skipping SageMaker refinement (no endpoint configured)")

        # Step 5: Store results
        result = {
            'request_id': request_id,
            'timestamp': datetime.now().isoformat(),
            'initial_summary': initial_summary,
            'final_summary': final_summary,
            'sentiment': sentiment,
            'confidence': confidence,
            'used_rag': options.get('use_rag', False) and OPENSEARCH_ENDPOINT != 'none',
            'model_endpoint': endpoint_name,
            'request_context': request_context,
        }

        # Save to S3
        s3_key = f"results/{datetime.now().strftime('%Y/%m/%d')}/{request_id}.json"
        s3_client.put_object(
            Bucket=RESULTS_BUCKET,
            Key=s3_key,
            Body=json.dumps(result, indent=2),
            ContentType='application/json'
        )

        print(f"[{request_id}] Complete. Results saved to s3://{RESULTS_BUCKET}/{s3_key}")

        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*'
            },
            'body': json.dumps(result)
        }

    except Exception as e:
        print(f"Error: {str(e)}")
        print(traceback.format_exc())
        return {
            'statusCode': 500,
            'body': json.dumps({
                'error': str(e),
                'traceback': traceback.format_exc()
            })
        }


def retrieve_context(query: str, top_k: int = 3) -> str:
    """
    Retrieve relevant context from OpenSearch
    TODO: Implement OpenSearch vector search
    """
    # Placeholder - implement OpenSearch integration
    return ""


def refine_with_sagemaker(endpoint_name: str, summary: str, context: str, original_text: str) -> dict:
    """
    Refine summary and extract sentiment using fine-tuned SageMaker model
    """
    try:
        # Send original text to the model for summarization
        payload = {
            "inputs": original_text
        }

        response = sagemaker_runtime.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType='application/json',
            Body=json.dumps(payload)
        )

        result = json.loads(response['Body'].read().decode())

        # Extract the summary from the model's response
        refined_summary = result.get('summary', summary)

        return {
            'summary': refined_summary,
            'sentiment': 'neutral',  # TODO: Add sentiment analysis
            'confidence': 0.0
        }

    except Exception as e:
        print(f"SageMaker error: {str(e)}")
        return {
            'summary': summary,
            'sentiment': 'neutral',
            'confidence': 0.0
        }

Script to test the endpoint

#!/bin/bash
# Test script for the news summarization API

set -e

# Get API URL from CloudFormation stack
STACK_NAME="${1:-inference-pipeline-kate}"

echo "Getting API URL from stack: $STACK_NAME"
API_URL=$(aws cloudformation describe-stacks \
  --stack-name "$STACK_NAME" \
  --query 'Stacks[0].Outputs[?OutputKey==`ApiUrl`].OutputValue' \
  --output text)

if [ -z "$API_URL" ]; then
  echo "Error: Could not find API URL in stack outputs"
  exit 1
fi

echo "API URL: $API_URL"
echo ""

# Test 1: Health check
echo "Test 1: Health Check"
echo "===================="
curl -s "${API_URL}health" | jq .
echo -e "\n"

# Test 2: Technology news article
echo "Test 2: Technology News Article"
echo "================================"
curl -s -X POST "${API_URL}summarize" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Apple Inc. announced today the launch of its latest iPhone model, featuring significant improvements in camera technology and battery life. The new device includes a 48-megapixel main camera, up from the previous 12-megapixel sensor, and promises up to 20 hours of video playback. The company also introduced new AI-powered features for photo editing and enhanced security measures. Pre-orders begin next Friday, with the device hitting stores two weeks later. Industry analysts predict strong sales, particularly in the premium smartphone segment. The starting price is set at $999 for the base model.",
    "options": {
      "use_rag": false
    }
  }' | jq .
echo -e "\n"

# Test 3: Political news article
echo "Test 3: Political News Article"
echo "==============================="
curl -s -X POST "${API_URL}summarize" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "The Senate voted 65-35 today to pass a comprehensive infrastructure bill worth $1.2 trillion. The bipartisan legislation includes funding for roads, bridges, public transit, and broadband internet expansion. Supporters argue the bill will create millions of jobs and modernize aging infrastructure. Critics express concerns about the cost and potential impact on the federal deficit. The bill now moves to the House of Representatives for consideration. President Biden praised the Senate vote, calling it a historic investment in America future. The legislation has been in negotiation for months.",
    "options": {
      "use_rag": false
    }
  }' | jq .
echo -e "\n"

# Test 4: Business news article
echo "Test 4: Business News Article"
echo "=============================="
curl -s -X POST "${API_URL}summarize" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Tesla reported record quarterly earnings today, beating Wall Street expectations. The electric vehicle maker delivered 250,000 vehicles in the quarter, a 40% increase from the same period last year. Revenue reached $13.8 billion, up from $10.4 billion a year ago. CEO Elon Musk attributed the strong performance to increased production capacity and growing demand for electric vehicles globally. The company also announced plans to build two new manufacturing facilities in Europe and Asia. Tesla stock rose 8% in after-hours trading following the earnings announcement.",
    "options": {
      "use_rag": false
    }
  }' | jq .
echo -e "\n"

# Test 5: Sports news article
echo "Test 5: Sports News Article"
echo "============================"
curl -s -X POST "${API_URL}summarize" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "In a thrilling championship game, the Lakers defeated the Celtics 108-105 to win their 18th NBA title. LeBron James led the team with 32 points, 11 rebounds, and 8 assists in what many are calling one of the greatest performances in Finals history. The victory came after the Lakers trailed by 15 points in the third quarter. Anthony Davis contributed 28 points and played crucial defense in the final minutes. This marks the Lakers first championship in over a decade. Head coach Frank Vogel praised the team resilience and determination throughout the playoffs.",
    "options": {
      "use_rag": false
    }
  }' | jq .
echo -e "\n"

echo "All tests completed!"

Deploy the app

1. Deploy the resource on AWS

cdk deploy --all
It will deploy three stacks: AppConfigStack, TrainingPipelineStack and InferencePipelineStack

2. Download the training data

python3 scripts/download_training_data.py \
  --source huggingface \
  --dataset cnn_dailymail \
  --num-samples 5000

3. Upload the datasets to S3 bucket

python3 scripts/upload_training_data.py

4. Start training job

python3 scripts/start_training.py \        
  --model-name t5-base \
  --epochs 5 \
  --batch-size 4 \
  --instance-type ml.g4dn.xlarge

The script using LoRA for fine-tuning default, if you do want to full fine-tuning explicitly put it in the command

python3 scripts/start_training.py --no-lora

5. Get the metrics

# Download metrics from S3
# Get job name from previous step
JOB_NAME="review-summarizer-kate-xxxxxx-xxxxxx"


MODEL_BUCKET=$(aws cloudformation describe-stacks \
 --stack-name training-pipeline-kate \
 --query 'Stacks[0].Outputs[?OutputKey==`ModelBucketName`].OutputValue' \
 --output text)


aws s3 cp s3://$MODEL_BUCKET/models/$JOB_NAME/output/output.tar.gz .
tar -xzf output.tar.gz


# View metrics
cat metrics.json

The result will look like

{
  "validation_rouge_l": 0.2694268479883026,
  "test_rouge_l": 0.27888068965217705,
  "final_train_loss": 0.8831174189448356,
  "use_lora": true,
  "trainable_params": 884736,
  "model_name": "t5-base",
  "epochs": 5,
  "batch_size": 4,
  "learning_rate": 3e-05,
  "lora_config": {
    "r": 8,
    "alpha": 32,
    "dropout": 0.1
  } 
}

Depending on the result, you can choose to adjust the parameter of the training script to get better results. For example, changing pre-training models or getting more training data.

6. Register Model

If you are happy with the result, register the model and wait for approval

# Create model package
aws sagemaker create-model-package \
 --model-package-group-name "review-summarizer" \
 --model-package-description "Fine-tuned T5 for review summarization" \
 --inference-specification '{
   "Containers": [{
     "Image": "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.0.1-gpu-py310",
     "ModelDataUrl": "s3://'"$MODEL_BUCKET"'/models/'"$JOB_NAME"'/output/model.tar.gz"
   }],
   "SupportedContentTypes": ["application/json"],
   "SupportedResponseMIMETypes": ["application/json"]
 }' \
 --model-approval-status "PendingManualApproval"

7. Approve the model package

# Get model package ARN from previous step
aws sagemaker list-model-packages --model-package-group-name "review-summarizer"


MODEL_PACKAGE_ARN="arn:aws:sagemaker:ap-southeast-2:123456789012:model-package/review-summarizer/1"


# Approve model
aws sagemaker update-model-package \
 --model-package-arn $MODEL_PACKAGE_ARN \
 --model-approval-status "Approved"

This will automatically trigger a lambda function to create SageMaker endpoint and update Parameter Store with the new endpoint

8. Test the api

./scripts/test_api.sh

Result

Test 1: Health Check
====================
{
  "status": "healthy"
}


Test 2: Technology News Article
================================
{
  "request_id": "2026-02-15T10:33:26.300629",
  "timestamp": "2026-02-15T10:33:30.740083",
  "initial_summary": "Here is a concise summary of the customer review:\n\nThe new iPhone model features significant upgrades, including a 48-megapixel main camera and up to 20 hours of video playback. It also includes new AI-powered photo editing features and enhanced security measures. Pre-orders begin next Friday, with the device launching two weeks later. Industry analysts predict strong sales, particularly in the premium smartphone segment, with a starting price of $999 for the base model.",
  "final_summary": "the new iPhone features a 48-megapixel main camera and 20 hours of video playback. the company also introduced new AI-powered features for photo editing. Industry analysts predict strong sales, particularly in the premium smartphone segment.",
  "sentiment": "neutral",
  "confidence": 0.0,
  "used_rag": false,
  "model_endpoint": "endpoint-kate",
  "request_context": {
    "category": "general",
    "userTier": "standard",
    "textLength": 602,
    "userId": null,
    "requestId": "2026-02-15T10:33:26.300629"
  }
}


Test 3: Political News Article
===============================
{
  "request_id": "2026-02-15T10:33:30.933191",
  "timestamp": "2026-02-15T10:33:34.585612",
  "initial_summary": "Here is a concise, objective summary of the customer review:\n\nThe Senate passed a $1.2 trillion bipartisan infrastructure bill that includes funding for roads, bridges, public transit, and broadband. Supporters say it will create jobs and modernize infrastructure, while critics are concerned about the cost and impact on the federal deficit. The bill now goes to the House for consideration, and President Biden praised the Senate's historic vote.",
  "final_summary": "the bill includes funding for roads, bridges, public transit, and broadband internet expansion. President Biden calls the vote a historic investment in America future.",
  "sentiment": "neutral",
  "confidence": 0.0,
  "used_rag": false,
  "model_endpoint": "endpoint-kate",
  "request_context": {
    "category": "general",
    "userTier": "standard",
    "textLength": 598,
    "userId": null,
    "requestId": "2026-02-15T10:33:30.933191"
  }
}


Test 4: Business News Article
==============================
{
  "request_id": "2026-02-15T10:33:34.705612",
  "timestamp": "2026-02-15T10:33:38.517618",
  "initial_summary": "Here is a concise, objective summary of the customer review:\n\nTesla reported record quarterly earnings, beating Wall Street expectations. The company delivered 250,000 vehicles, a 40% increase from the previous year, and revenue reached $13.8 billion. CEO Elon Musk attributed the strong performance to increased production capacity and growing global demand for electric vehicles. Tesla also announced plans to build two new manufacturing facilities in Europe and Asia, and the stock price rose 8% after the earnings announcement.",
  "final_summary": "Tesla delivered 250,000 vehicles in the quarter, a 40% increase from the same period last year. Revenue reached $13.8 billion, up from $10.4 billion a year ago.",
  "sentiment": "neutral",
  "confidence": 0.0,
  "used_rag": false,
  "model_endpoint": "endpoint-kate",
  "request_context": {
    "category": "general",
    "userTier": "standard",
    "textLength": 570,
    "userId": null,
    "requestId": "2026-02-15T10:33:34.705612"
  }
}


Test 5: Sports News Article
============================
{
  "request_id": "2026-02-15T10:33:38.619204",
  "timestamp": "2026-02-15T10:33:42.190132",
  "initial_summary": "In a closely contested NBA Finals, the Los Angeles Lakers defeated the Boston Celtics 108-105 to win their 18th championship. LeBron James delivered an outstanding performance with 32 points, 11 rebounds, and 8 assists, while Anthony Davis added 28 points and played strong defense in the closing minutes. The Lakers overcame a 15-point deficit in the third quarter to secure the victory, showcasing their resilience and determination throughout the playoffs, as praised by head coach Frank Vogel.",
  "final_summary": "LeBron James led the team with 32 points, 11 rebounds, and 8 assists. This is the Lakers first championship in over a decade.",
  "sentiment": "neutral",
  "confidence": 0.0,
  "used_rag": false,
  "model_endpoint": "endpoint-kate",
  "request_context": {
    "category": "general",
    "userTier": "standard",
    "textLength": 563,
    "userId": null,
    "requestId": "2026-02-15T10:33:38.619204"
  }
}


All tests completed!

Now we have a complete fine-tuning pipeline with automatic model deployment. The application automatically uses the latest approved Amazon SageMaker endpoint for inference.
In addition, we can integrate Retrieval-Augmented Generation (RAG) into the pipeline. This involves setting up Amazon OpenSearch as a vector database, embedding relevant documents, and updating the Lambda function to retrieve contextual information before generating summaries (refer to this).
Currently, the system immediately switches to the new model once approved. However, we can implement A/B testing to gradually roll out the model, reducing potential risks and ensuring smoother transitions.
Link to the repo

Build a Knowledge-Based Q&A Bot using Bedrock + S3 + DynamoDB/OpenSearch via AWS CDK

Kate Vu — Wed, 24 Dec 2025 11:20:44 +0000

Large Language Models (LLMs) are incredibly powerful at generating content, but they can have “hallucinations” (making up things with confidence) or give us outdated data, since they are “bound to the data they were trained on” (Julien, Hanza, and Antonio – LLM Engineer’s Handbook).
In many cases, you may also want your LLMs to answer questions using your own documents or internal knowledge bases. Retraining models to achieve this is time and cost-consuming.
That is where Retrieval-Augmented Generation (RAG) becomes a practical solution.
In this blog, we will build a Q&A bot using a RAG architecture, with the knowledge base stored in Amazon DynamoDB for non-production environments and Amazon OpenSearch for production workloads.
The app is built using Kiro 🔥

What this app can do

The app allows you to build a knowledge base for the model. Users only need to upload documents (currently supporting .txt and .pdf formats). If any documents become obsolete, simply delete them—the app will automatically trigger a process to remove the corresponding embedding vectors. The same happens when re-uploading documents: old chunks are removed before adding new ones.
The app provides a fallback to the general LLM knowledge, with a clear indication when no relevant documents are found. The threshold is configurable for easy updates.
The app is built using AWS CDK. For cost savings, only the production environment uses AWS OpenSearch as the vector database, with CloudFront in front of S3 hosting the static website. Other environments use AWS DynamoDB as the vector store.
If you prefer a Q&A bot without a knowledge base, simply set the ENABLE_RAG environment variable to false before deploying. If no value is set, it defaults to true, and the knowledge base will be deployed.
Before jumping into building the app, there are some terms that will be used:

RAG

Retrieval-augmented generation (RAG) is a method created by Meta to enhance the accuracy of LLM and reduce false information (Louis, Building LLMs for Production). RAG works by adding information from the retrieval step as context to the prompt, then the LLMs generate the answer. RAG allow you keep the LLMs up to date without retraining the model.

Tokens and Embeddings

Tokens are small chunks of text. For the LLMs to compute language, it converts tokens into numeric representations called embeddings. Embeddings are vector representations of data that attempt to capture its meaning (Jay, Maarten - Hand-on Large Language Models)

Vector Database

A vector database is a specialized system that stores and queries high-dimensional vectors efficiently. These databases are fundamental for Retrieval Augmented Generation (RAG) applications.Overview of vector databases - AWS Prescriptive Guidance.
Amazon supports several vector database options including Amazon OpenSearch, Kendra, and Amazon RDS for PostgreSQL with pgvector. Vector database options - AWS Prescriptive Guidance

Architecture

Before we dive into the step by step walkthrough of the diagram, we split the architecture into to production and non-production environments for cost savings.
For production, we use Amazon OpenSearch as vector database and leverage the native vector search it provided, along with AWS CloudFront for CDN for https and caching.
For non-production we use Amazon DynamoDB as vector store and use AWS Lambda to scan items and compute cosine similarity.

Step 1: Upload document to S3

Users upload documents for knowledge base to Amazon S3.
This automatically triggers AWS Lambda function to generate embeddings.

Step 2: Generate Embeddings

AWS Lambda function chunks the documents and invokes Amazone Bedrock with the Titan model to generate embeddings.
Depending on the environment, the embeddings are stored in Amazon DynamoDB (non-production) or Amazon OpenSearch (production environment).
Each record store:

chunkId
documentId
chunkIndex
content
embedding
sourceKey
format
createdAt

Step 3: Users ask questions via frontend

Users access the frontend, configure API Gateway URL, API key, then submit their questions.

Step 4: Frontend sends the request to API Gateway

The request is sent from the frontend to Amazon API Gateway

Step 5: Amazon API Gateway invokes AWS Lambda function

API Gateway invoke lambdas a function to process the request and generate an answer.

Step 6: Amazon Lambda handles the request

Firstly, the Lambda function generate an embbeding for the question by calling the Amazon Bedrock API with the Titan model

For non-production, Lambda searchs for similar chunks by scanning DynamoDB and computing the cosine similarity.
For production, Lambda calls Amazon OpenSearch to perform the native vector similarity search Lambda then formats the prompt using the retrieved relevant text and call Amzon Bedrock InvokeModel API with Clause Sonet to get the final answer Finally, Lambda returns the response to the frontend Note: The models used for embedding generation and for producing the final answer are configurable.

AWS Resources:

AWS CloudFront
AWS S3 buckets
Amazon API Gateway
AWS Lambda
AWS DynamoDB
AWS OpenSearch Service
Amazon Bedrock
AWS Identity and Access Management (IAM)
Amazon CloudWatch

Prerequisites:

An AWS account that has been bootstrapped for AWS CDK
Environment setup: ** Note.js ** Typescript ** AWS CDK Toolkit ** Docker (used for bundling Lambda functions when deploying)
AWS Credentials: keep them handy so you can deploy the stacks

Building the app

1. Build the frontend

The frontend is built using simple HTML and JavaScript.
We will create two files: index.html and error.html, which will be uploaded to S3 later
Index.html

<!DOCTYPE html>
<html lang="en">
<head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
 <title>Knowledge Q&A Bot</title>
 <style>
   * { box-sizing: border-box; margin: 0; padding: 0; }
   body {
     font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
     background: #f5f5f5;
     min-height: 100vh;
     padding: 20px;
   }
   .container {
     max-width: 800px;
     margin: 0 auto;
   }
   h1 {
     text-align: center;
     color: #333;
     margin-bottom: 30px;
   }
   .card {
     background: white;
     border-radius: 12px;
     padding: 24px;
     margin-bottom: 20px;
     box-shadow: 0 2px 8px rgba(0,0,0,0.1);
   }
   .card h2 {
     font-size: 18px;
     color: #666;
     margin-bottom: 16px;
   }
   .input-group {
     display: flex;
     gap: 10px;
     margin-bottom: 16px;
   }
   input[type="text"], textarea {
     flex: 1;
     padding: 12px 16px;
     border: 2px solid #e0e0e0;
     border-radius: 8px;
     font-size: 16px;
     transition: border-color 0.2s;
   }
   input[type="text"]:focus, textarea:focus {
     outline: none;
     border-color: #007bff;
   }
   textarea {
     min-height: 100px;
     resize: vertical;
   }
   button {
     padding: 12px 24px;
     background: #007bff;
     color: white;
     border: none;
     border-radius: 8px;
     font-size: 16px;
     cursor: pointer;
     transition: background 0.2s;
   }
   button:hover { background: #0056b3; }
   button:disabled {
     background: #ccc;
     cursor: not-allowed;
   }
   .file-item {
     display: flex;
     justify-content: space-between;
     align-items: center;
     padding: 8px 12px;
     background: #f5f5f5;
     border-radius: 6px;
     margin-bottom: 8px;
   }
   .file-item .remove {
     color: #dc3545;
     cursor: pointer;
     padding: 4px 8px;
   }
   .answer-box {
     background: #f8f9fa;
     border-radius: 8px;
     padding: 16px;
     margin-top: 16px;
     display: none;
   }
   .answer-box.show { display: block; }
   .answer-box h3 {
     font-size: 14px;
     color: #666;
     margin-bottom: 8px;
   }
   .answer-text {
     font-size: 16px;
     line-height: 1.6;
     color: #333;
   }
   .sources {
     margin-top: 16px;
     padding-top: 16px;
     border-top: 1px solid #e0e0e0;
   }
   .sources h4 {
     font-size: 12px;
     color: #888;
     margin-bottom: 8px;
   }
   .source-item {
     font-size: 13px;
     color: #666;
     padding: 8px;
     background: white;
     border-radius: 4px;
     margin-bottom: 6px;
   }
   .source-item .score {
     color: #28a745;
     font-weight: 600;
   }
   .loading {
     display: none;
     text-align: center;
     padding: 20px;
   }
   .loading.show { display: block; }
   .spinner {
     width: 40px;
     height: 40px;
     border: 4px solid #f3f3f3;
     border-top: 4px solid #007bff;
     border-radius: 50%;
     animation: spin 1s linear infinite;
     margin: 0 auto 10px;
   }
   @keyframes spin {
     0% { transform: rotate(0deg); }
     100% { transform: rotate(360deg); }
   }
   .config-section {
     display: flex;
     gap: 10px;
     flex-wrap: wrap;
   }
   .config-section input {
     flex: 1;
     min-width: 200px;
   }
   .status {
     padding: 8px 12px;
     border-radius: 6px;
     font-size: 14px;
     margin-top: 10px;
   }
   .status.success { background: #d4edda; color: #155724; }
   .status.error { background: #f8d7da; color: #721c24; }
   .status.info { background: #cce5ff; color: #004085; }
 </style>
</head>
<body>
 <div class="container">
   <h1>📚 Knowledge Q&A Bot</h1>


   <!-- Config -->
   <div class="card">
     <h2>⚙️ Configuration</h2>
     <div class="config-section">
       <div style="flex: 1; min-width: 200px;">
         <label for="apiUrl" style="display: block; font-size: 12px; color: #666; margin-bottom: 4px;">API URL</label>
         <input type="text" id="apiUrl" placeholder="https://xxx.execute-api.ap-southeast-2.amazonaws.com/kate">
       </div>
       <div style="flex: 1; min-width: 200px;">
         <label for="apiKey" style="display: block; font-size: 12px; color: #666; margin-bottom: 4px;">API Key</label>
         <input type="text" id="apiKey" placeholder="Enter your API key">
       </div>
       <button onclick="saveConfig()">Save</button>
     </div>
     <div id="configStatus"></div>
   </div>


   <!-- Ask -->
   <div class="card">
     <h2>💬 Ask a Question</h2>
     <div style="background: #f0f7ff; padding: 12px; border-radius: 6px; margin-bottom: 16px; font-size: 13px; color: #555;">
       <strong>💡 Tips for better answers:</strong>
       <ul style="margin: 8px 0 0 20px; padding: 0;">
         <li>Be specific: "How do I configure Lambda timeout?" vs "Tell me about Lambda"</li>
         <li>Ask one thing at a time for focused responses</li>
         <li>Include context: "In AWS CDK, how do I..." helps narrow the search</li>
         <li>Try rephrasing if the first answer isn't helpful</li>
       </ul>
     </div>
     <div class="input-group">
       <textarea id="question" placeholder="Example: How do I set up DynamoDB with on-demand billing in AWS CDK?"></textarea>
     </div>
     <button onclick="askQuestion()" id="askBtn">Ask</button>

     <div class="loading" id="loading">
       <div class="spinner"></div>
       <p>Thinking...</p>
     </div>

     <div class="answer-box" id="answerBox">
       <h3>Answer</h3>
       <div class="answer-text" id="answerText"></div>
       <div class="sources" id="sources"></div>
     </div>
   </div>
 </div>


 <script>
   // Load saved config
   document.getElementById('apiUrl').value = localStorage.getItem('apiUrl') || '';
   document.getElementById('apiKey').value = localStorage.getItem('apiKey') || '';


   function saveConfig() {
     const apiUrl = document.getElementById('apiUrl').value.trim();
     const apiKey = document.getElementById('apiKey').value.trim();
     localStorage.setItem('apiUrl', apiUrl);
     localStorage.setItem('apiKey', apiKey);
     showStatus('configStatus', 'Configuration saved!', 'success');
   }


   function showStatus(elementId, message, type) {
     const el = document.getElementById(elementId);
     el.innerHTML = `<div class="status ${type}">${message}</div>`;
     setTimeout(() => el.innerHTML = '', 5000);
   }


   async function askQuestion() {
     const apiUrl = localStorage.getItem('apiUrl');
     const apiKey = localStorage.getItem('apiKey');
     const question = document.getElementById('question').value.trim();


     if (!apiUrl || !apiKey) {
       showStatus('configStatus', 'Please configure API URL and Key first', 'error');
       return;
     }


     if (!question) {
       alert('Please enter a question');
       return;
     }


     const loading = document.getElementById('loading');
     const answerBox = document.getElementById('answerBox');
     const askBtn = document.getElementById('askBtn');


     loading.classList.add('show');
     answerBox.classList.remove('show');
     askBtn.disabled = true;


     try {
       const response = await fetch(`${apiUrl}/ask`, {
         method: 'POST',
         headers: {
           'Content-Type': 'application/json',
           'x-api-key': apiKey
         },
         body: JSON.stringify({ question, topK: 5 })
       });


       const data = await response.json();


       if (!response.ok) {
         throw new Error(data.error || 'Request failed');
       }


       // Display fallback notice if applicable
       let answerHtml = '';
       if (data.fallback) {
         answerHtml = `
           <div style="background: #fff3cd; border-left: 4px solid #ffc107; padding: 12px; margin-bottom: 12px; border-radius: 4px;">
             <strong>⚠️ General Knowledge Response</strong>
             <p style="margin: 4px 0 0 0; font-size: 13px; color: #856404;">
               ${data.fallbackReason || 'No relevant documents found in knowledge base'}.
               This answer is based on general knowledge, not the knowledge base.
             </p>
           </div>
         `;
       }
       answerHtml += `<div>${data.answer}</div>`;
       document.getElementById('answerText').innerHTML = answerHtml;


       const sourcesEl = document.getElementById('sources');
       if (data.sources && data.sources.length > 0) {
         sourcesEl.innerHTML = `
           <h4>Sources</h4>
           ${data.sources.map(s => `
             <div class="source-item">
               <strong>${s.documentId}</strong> (chunk ${s.chunkIndex})
               <span class="score">${(s.score * 100).toFixed(1)}% match</span>
               <p style="margin-top: 4px; font-size: 12px;">${s.excerpt}</p>
             </div>
           `).join('')}
         `;
       } else {
         sourcesEl.innerHTML = '';
       }


       answerBox.classList.add('show');


     } catch (error) {
       alert('Error: ' + error.message);
     } finally {
       loading.classList.remove('show');
       askBtn.disabled = false;
     }
   }


   // Enter key to submit
   document.getElementById('question').addEventListener('keydown', (e) => {
     if (e.key === 'Enter' && !e.shiftKey) {
       e.preventDefault();
       askQuestion();
     }
   });
 </script>
</body>
</html>

Error.html

<!DOCTYPE html>
<html lang="en">
<head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
 <title>Error - Knowledge Q&A Bot</title>
 <style>
   * {
     margin: 0;
     padding: 0;
     box-sizing: border-box;
   }


   body {
     font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
     background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
     min-height: 100vh;
     display: flex;
     align-items: center;
     justify-content: center;
     padding: 20px;
   }


   .error-container {
     background: white;
     border-radius: 20px;
     box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
     max-width: 600px;
     width: 100%;
     padding: 60px 40px;
     text-align: center;
   }


   .error-icon {
     font-size: 80px;
     margin-bottom: 20px;
   }


   h1 {
     color: #333;
     font-size: 48px;
     margin-bottom: 10px;
     font-weight: 700;
   }


   h2 {
     color: #666;
     font-size: 24px;
     margin-bottom: 30px;
     font-weight: 400;
   }


   p {
     color: #777;
     font-size: 16px;
     line-height: 1.6;
     margin-bottom: 30px;
   }


   .btn {
     padding: 12px 30px;
     border: none;
     border-radius: 8px;
     font-size: 16px;
     font-weight: 600;
     cursor: pointer;
     text-decoration: none;
     display: inline-block;
     background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
     color: white;
   }


   @media (max-width: 600px) {
     .error-container {
       padding: 40px 20px;
     }


     h1 {
       font-size: 36px;
     }


     h2 {
       font-size: 20px;
     }
   }
 </style>
</head>
<body>
 <div class="error-container">
   <div class="error-icon">⚠️</div>
   <h1>Oops!</h1>
   <h2>Something went wrong</h2>
   <p>The page you're looking for doesn't exist. This might happen if the URL is incorrect or the page has been removed.</p>
   <a href="/" class="btn">Go to Home</a>
 </div>
</body>
</html>

2. Build the infrastructure using AWS CDK

2.1 S3 Buckets

We will use 2 buckets for this app:
Frontend S3 bucket: host the website

/**
* S3 Buckets Stack
*
* This stack creates and manages S3 buckets for the Knowledge Q&A Bot:
* - Documents bucket: Private storage for uploaded documents (TXT, PDF)
* - Frontend bucket: Public static website hosting for the chat UI
*
* The stack is separated from the main application stack to allow
* independent lifecycle management of storage resources.
*/


import * as cdk from 'aws-cdk-lib';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as s3deploy from 'aws-cdk-lib/aws-s3-deployment';
import { Construct } from 'constructs';
import * as path from 'path';


/**
* Props for the S3 Buckets Stack
*/
export interface S3BucketsStackProps extends cdk.StackProps {
 /** Name of the CloudFormation stack */
 stackName: string;
 /** AWS region for deployment */
 region: string;
 /** AWS account ID */
 accountId: string;
 /** Environment name (e.g., 'kate', 'dev', 'prod') */
 envName: string;
 /** Name for the frontend S3 bucket */
 frontendBucketName: string;
}


/**
* Stack that creates S3 bucket for frontend hosting
* Note: Documents bucket has been moved to KnowledgeQaBotStack to avoid cyclic dependencies
*/
export class S3BucketsStack extends cdk.Stack {
 /** S3 bucket for hosting the frontend static website */
 public readonly frontendBucket: s3.Bucket;


 constructor(scope: Construct, id: string, props: S3BucketsStackProps) {
   const { region, accountId, envName } = props;


   // Merge environment configuration with provided props
   const updatedProps = {
     env: {
       region: region,
       account: accountId,
     },
     ...props,
   };


   super(scope, id, updatedProps);


   // ========================================
   // Frontend Bucket
   // ========================================
   // Public bucket for hosting the static website (HTML/CSS/JS)
   // - Website hosting enabled with index.html as default
   // - Public read access for website visitors
   // - Always destroyed on stack deletion (frontend can be redeployed)
   this.frontendBucket = new s3.Bucket(this, 'FrontendBucket', {
     bucketName: props.frontendBucketName,
     websiteIndexDocument: 'index.html',
     websiteErrorDocument: 'error.html',
     publicReadAccess: true,
     blockPublicAccess: new s3.BlockPublicAccess({
       blockPublicAcls: false,
       blockPublicPolicy: false,
       ignorePublicAcls: false,
       restrictPublicBuckets: false,
     }),
     removalPolicy: cdk.RemovalPolicy.DESTROY,
     autoDeleteObjects: true,
   });


   // ========================================
   // Frontend Deployment
   // ========================================
   // Automatically deploy frontend files from ./frontend directory
   // This runs on every CDK deploy to update the website
   new s3deploy.BucketDeployment(this, 'DeployFrontend', {
     sources: [s3deploy.Source.asset(path.join(__dirname, '../frontend'))],
     destinationBucket: this.frontendBucket,
   });


   // ========================================
   // Stack Outputs
   // ========================================
   // Export values for use by other stacks and for reference


   new cdk.CfnOutput(this, 'FrontendBucketName', {
     value: this.frontendBucket.bucketName,
     description: 'S3 bucket for frontend',
     exportName: `FrontendBucketName-${envName}`,
   });


   new cdk.CfnOutput(this, 'FrontendUrl', {
     value: this.frontendBucket.bucketWebsiteUrl,
     description: 'Frontend website URL (S3 direct)',
     exportName: `FrontendS3Url-${envName}`,
   });
 }
}

Document S3 storage bucket: stores the documents uploaded for knowledge base, currently accepts txt and pdf format

// ========================================
   // Documents S3 Bucket
   // ========================================
   // Create the documents bucket in this stack to avoid cyclic dependencies
   // with S3 event notifications

   // For convenience during development/testing, always destroy buckets
   // Uncomment below for production to retain data on stack deletion
   // const isProduction = envName.toLowerCase() === 'prod';
   // const documentsRemovalPolicy = isProduction
   //   ? cdk.RemovalPolicy.RETAIN
   //   : cdk.RemovalPolicy.DESTROY;
   // const autoDeleteDocuments = !isProduction;

   const documentsRemovalPolicy = cdk.RemovalPolicy.DESTROY;
   const autoDeleteDocuments = true;


   this.documentsBucket = new s3.Bucket(this, 'DocumentsBucket', {
     bucketName: resourceName(envName, 'documents'),
     encryption: s3.BucketEncryption.S3_MANAGED,
     blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
     versioned: true,
     enforceSSL: true,
     removalPolicy: documentsRemovalPolicy,
     autoDeleteObjects: autoDeleteDocuments,
   });

2.2 API Gateway

We will create REST API with

API key for authentication
Usage plan to control access and prevent over spending

   // ========================================
   // API Gateway
   // ========================================
   // REST API for Q&A queries with CORS support
   this.api = new apigateway.RestApi(this, 'QaApi', {
     restApiName: resourceName(envName, 'api'),
     description: 'Knowledge Q&A Bot API',
     deployOptions: {
       stageName: envName,
     },
     // Enable CORS for frontend access
     defaultCorsPreflightOptions: {
       allowOrigins: apigateway.Cors.ALL_ORIGINS,
       allowMethods: apigateway.Cors.ALL_METHODS,
       allowHeaders: ['Content-Type', 'x-api-key'],
     },
   });


   // ========================================
   // API Key & Usage Plan
   // ========================================
   // Protect API with key and enforce rate limits to control costs
   const apiKey = this.api.addApiKey('ApiKey', {
     apiKeyName: resourceName(envName, 'api-key'),
   });


   const usagePlan = this.api.addUsagePlan('UsagePlan', {
     name: resourceName(envName, 'usage-plan'),
     throttle: {
       rateLimit: 10, // 10 requests per second
       burstLimit: 20, // Allow bursts up to 20
     },
     quota: {
       limit: 1000, // 1000 requests per month
       period: apigateway.Period.MONTH,
     },
   });


   usagePlan.addApiKey(apiKey);
   usagePlan.addApiStage({ stage: this.api.deploymentStage });


   // ========================================
   // API Endpoints
   // ========================================
   // POST /ask - Submit a question and get an answer
   const askResource = this.api.root.addResource('ask');
   askResource.addMethod('POST', new apigateway.LambdaIntegration(this.queryFunction), {
     apiKeyRequired: true,
   });

2.3 Lambda

Lambda will be used for both document ingestion and query processing
Document Ingestion

Trigger by OBJECT_CREATED/OBJECT_REMOVE events from S3
Chunks the documents
Calls Amazon Bedrock with the Titan model to generate embeddings
Stores embeddings in Amazon DynamoDB (non-production) or Amazon OpenSearch (production) Handler

"""
Document Ingestion Lambda Handler


This Lambda function is triggered by S3 object creation events when
documents are uploaded to the documents bucket. It processes each
document through the following pipeline:


1. Parse: Extract text content from TXT or PDF files
2. Chunk: Split text into overlapping chunks for better retrieval
3. Embed: Generate vector embeddings using Amazon Bedrock Titan
4. Store: Save chunks and embeddings to vector store (DynamoDB or OpenSearch)


Features:
- Automatic document deletion: Removes old chunks when re-uploading
- Batch processing: Efficiently stores multiple chunks
- Error handling: Continues processing even if individual documents fail
- Flexible storage: Supports both DynamoDB (dev) and OpenSearch (prod)


Environment Variables:
   ENABLE_RAG: Enable/disable RAG mode (default: true)
   USE_OPENSEARCH: Use OpenSearch instead of DynamoDB (default: false)
   TABLE_NAME: DynamoDB table for storing chunks (if not using OpenSearch)
   OPENSEARCH_ENDPOINT: OpenSearch domain endpoint (if using OpenSearch)
   BUCKET_NAME: S3 bucket containing documents
   CHUNK_SIZE: Target size for text chunks (default: 1000)
   CHUNK_OVERLAP: Overlap between chunks (default: 200)
   EMBEDDING_MODEL: Bedrock model ID for embeddings
   LOG_LEVEL: Logging level (default: INFO)
"""


import json
import os
import logging
from typing import Any


import boto3
from services.parser import DocumentParser
from services.chunker import TextChunker
from services.embedding import EmbeddingService
from services.vector_store import VectorStore
from services.serialization import serialize_chunk


# ========================================
# Logging Configuration
# ========================================
log_level = os.environ.get("LOG_LEVEL", "INFO")
logging.basicConfig(level=log_level)
logger = logging.getLogger(__name__)


# ========================================
# AWS Client Initialization
# ========================================
# Initialize AWS clients once at module load for connection reuse
s3_client = boto3.client("s3")
dynamodb = boto3.resource("dynamodb")
bedrock = boto3.client("bedrock-runtime")


# ========================================
# Environment Variables
# ========================================
ENABLE_RAG = os.environ.get("ENABLE_RAG", "true").lower() == "true"
USE_OPENSEARCH = os.environ.get("USE_OPENSEARCH", "false").lower() == "true"
TABLE_NAME = os.environ.get("TABLE_NAME", "not-used")
BUCKET_NAME = os.environ["BUCKET_NAME"]
CHUNK_SIZE = int(os.environ.get("CHUNK_SIZE", "1000"))
CHUNK_OVERLAP = int(os.environ.get("CHUNK_OVERLAP", "200"))
EMBEDDING_MODEL = os.environ.get("EMBEDDING_MODEL", "amazon.titan-embed-text-v1")


# ========================================
# Service Initialization
# ========================================
# Initialize services with configured clients and settings
parser = DocumentParser(s3_client, BUCKET_NAME)
chunker = TextChunker(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
embedding_service = EmbeddingService(bedrock, EMBEDDING_MODEL)


# Choose vector store based on environment
if ENABLE_RAG:
   if USE_OPENSEARCH:
       from services.opensearch_vector_store import OpenSearchVectorStore
       opensearch_endpoint = os.environ.get("OPENSEARCH_ENDPOINT", "")
       if not opensearch_endpoint:
           raise ValueError("OPENSEARCH_ENDPOINT required when USE_OPENSEARCH=true")
       vector_store = OpenSearchVectorStore(opensearch_endpoint)
       logger.info(f"Using OpenSearch at {opensearch_endpoint}")
   else:
       vector_store = VectorStore(dynamodb.Table(TABLE_NAME))
       logger.info(f"Using DynamoDB table {TABLE_NAME}")
else:
   vector_store = None
   logger.info("RAG disabled - no vector store initialized")




def handler(event: dict[str, Any], context: Any) -> dict[str, Any]:
   """
   Lambda handler for document ingestion and deletion.


   Processes S3 event notifications for uploaded/deleted documents:
   - ObjectCreated: Parse, chunk, embed, and store document
   - ObjectRemoved: Delete all chunks for the document


   Args:
       event: S3 event notification containing Records array
       context: Lambda context object (unused)


   Returns:
       Response dict with statusCode and body containing:
       - processed: Number of successfully processed documents
       - deleted: Number of successfully deleted documents
       - errors: List of error messages for failed operations
   """
   logger.info(f"Received event: {json.dumps(event)}")


   processed_count = 0
   deleted_count = 0
   errors = []


   # Process each S3 event record
   for record in event.get("Records", []):
       try:
           # ========================================
           # Extract S3 Object Information
           # ========================================
           bucket = record["s3"]["bucket"]["name"]
           key = record["s3"]["object"]["key"]
           event_name = record["eventName"]


           logger.info(f"Processing event {event_name}: s3://{bucket}/{key}")


           # Generate unique document ID from S3 key
           document_id = key.replace("/", "_").replace(".", "_")


           # ========================================
           # Handle Delete Events
           # ========================================
           if event_name.startswith("ObjectRemoved"):
               logger.info(f"Deleting chunks for document: {document_id}")
               vector_store.delete_by_document(document_id)
               deleted_count += 1
               logger.info(f"Successfully deleted document: {key}")
               continue


           # ========================================
           # Handle Upload/Reupload Events
           # ========================================
           # Step 0: Delete existing chunks (handles reuploads)
           # This ensures no orphaned chunks remain if new version
           # has fewer chunks than the old version
           logger.info(f"Checking for existing chunks: {document_id}")
           vector_store.delete_by_document(document_id)


           # ========================================
           # Step 1: Parse Document
           # ========================================
           # Extract text content from TXT or PDF file
           parsed_doc = parser.parse(key)
           if not parsed_doc:
               logger.warning(f"Could not parse document: {key}")
               continue


           # ========================================
           # Step 2: Chunk Text
           # ========================================
           # Split into overlapping chunks for better retrieval
           chunks = chunker.chunk(parsed_doc["content"])
           logger.info(f"Created {len(chunks)} chunks from document")


           # ========================================
           # Step 3 & 4: Embed and Store Each Chunk
           # ========================================
           for chunk in chunks:
               # Generate vector embedding via Bedrock Titan
               embedding = embedding_service.embed(chunk["content"])


               # Prepare chunk record for DynamoDB
               stored_chunk = {
                   "chunkId": f"{document_id}#{chunk['index']}",
                   "documentId": document_id,
                   "chunkIndex": chunk["index"],
                   "content": chunk["content"],
                   "embedding": embedding,
                   "sourceKey": key,
                   "format": parsed_doc["metadata"]["format"],
                   "createdAt": parsed_doc["metadata"]["extractedAt"],
               }


               # Store serialized chunk in DynamoDB
               vector_store.store(serialize_chunk(stored_chunk))


           processed_count += 1
           logger.info(f"Successfully processed document: {key}")


       except Exception as e:
           logger.error(f"Error processing record: {e}", exc_info=True)
           errors.append(str(e))


   return {
       "statusCode": 200,
       "body": json.dumps(
           {
               "processed": processed_count,
               "deleted": deleted_count,
               "errors": errors,
           }
       ),
   }

Deploy Lambda function with its log group in AWS Cloudwatch

   // ========================================
   // Ingestion Lambda
   // ========================================
   // Triggered by S3 uploads, processes documents:
   // 1. Parse document (TXT/PDF)
   // 2. Split into chunks
   // 3. Generate embeddings via Bedrock Titan
   // 4. Store chunks + embeddings in DynamoDB

   // Create log group explicitly so it's managed by CloudFormation
   // and deleted when the stack is destroyed
   const logRetentionDays = config.logRetentionDays || 1;
   const ingestLogGroup = new logs.LogGroup(this, 'IngestLogGroup', {
     logGroupName: `/aws/lambda/${resourceName(envName, 'ingest')}`,
     retention: logRetentionDays as logs.RetentionDays,
     removalPolicy: cdk.RemovalPolicy.DESTROY,
   });

   this.ingestFunction = new PythonFunction(this, 'IngestFunction', {
     functionName: resourceName(envName, 'ingest'),
     entry: 'src/lambdas/ingest',
     runtime: lambda.Runtime.PYTHON_3_12,
     index: 'handler.py',
     handler: 'handler',
     description: `Document ingestion function (config: ${configHash})`,
     memorySize: config.lambdaMemorySize || 1024,
     timeout: Duration.seconds(config.lambdaTimeout || 300),
     environment: commonEnv,
     logGroup: ingestLogGroup,
   });

Then create an S3 trigger event and add this Lambda function as the destination.

  // ========================================
   // S3 Event Triggers (Only if RAG enabled)
   // ========================================
   // Automatically process documents when uploaded or deleted from S3
   // - OBJECT_CREATED: Parse, chunk, embed, and store
   // - OBJECT_REMOVED: Delete all chunks for the document
   // Skip if RAG is disabled (no document processing needed)
   //
   // Note: Event notifications are added here in the same stack where
   // the Lambda is created to avoid cyclic dependencies
   if (enableRag) {
     // Supported document formats
     // Add new formats here to automatically enable processing
     const supportedFormats = ['.txt', '.pdf'];

     // Register event notifications for each supported format
     supportedFormats.forEach(format => {
       // Handle uploads and reuploads
       this.documentsBucket.addEventNotification(
         s3.EventType.OBJECT_CREATED,
         new s3n.LambdaDestination(this.ingestFunction),
         { suffix: format }
       );

       // Handle deletions
       this.documentsBucket.addEventNotification(
         s3.EventType.OBJECT_REMOVED,
         new s3n.LambdaDestination(this.ingestFunction),
         { suffix: format }
       );
     });
   }

Query Processing

Receives user questions via API Gateway
Generates embeddings for the question using Amazon Bedrock
Searches for relevant chunks:
Non-prod: scan DynamoDB and compute cosine similarity
Prod: query OpenSearch using native vector search
Formats a prompt with the retrieved chunks
Calls Amazon Bedrock InvokeModel API to get the answer
Returns the response to the frontend

Handler

"""
Query Lambda Handler


This Lambda function handles Q&A requests from API Gateway.
It processes questions through the following pipeline:


1. Embed: Convert question to vector embedding (Bedrock Titan)
2. Retrieve: Find similar document chunks via cosine similarity (DynamoDB/OpenSearch)
3. Check: Verify similarity threshold (fallback to general knowledge if too low)
4. Generate: Create grounded answer (Bedrock Claude)
5. Respond: Return answer with source citations


Features:
- RAG mode: Retrieves relevant documents and generates grounded answers
- Direct LLM mode: Generates answers without retrieval (when RAG disabled)
- Smart fallback: Uses general knowledge when no relevant documents found
- Similarity threshold: Ensures retrieved documents are actually relevant


Environment Variables:
   ENABLE_RAG: Enable/disable RAG mode (default: true)
   TABLE_NAME: DynamoDB table containing document chunks
   TOP_K: Number of similar chunks to retrieve (default: 5)
   SIMILARITY_THRESHOLD: Minimum similarity score for RAG (default: 0.5)
   EMBEDDING_MODEL: Bedrock model ID for embeddings
   LLM_MODEL: Bedrock model ID for answer generation
   LOG_LEVEL: Logging level (default: INFO)
"""


import json
import os
import logging
from typing import Any


import boto3
from services.embedding import EmbeddingService
from services.retrieval import RetrievalService
from services.answer_generation import AnswerGenerationService
from services.vector_store import VectorStore


# ========================================
# Logging Configuration
# ========================================
log_level = os.environ.get("LOG_LEVEL", "INFO")
logging.basicConfig(level=log_level)
logger = logging.getLogger(__name__)


# ========================================
# AWS Client Initialization
# ========================================
# Initialize AWS clients once at module load for connection reuse
dynamodb = boto3.resource("dynamodb")
bedrock = boto3.client("bedrock-runtime")


# ========================================
# Environment Variables
# ========================================
ENABLE_RAG = os.environ.get("ENABLE_RAG", "true").lower() == "true"
TABLE_NAME = os.environ.get("TABLE_NAME", "not-used")
TOP_K = int(os.environ.get("TOP_K", "5"))
EMBEDDING_MODEL = os.environ.get("EMBEDDING_MODEL", "amazon.titan-embed-text-v1")
LLM_MODEL = os.environ.get("LLM_MODEL", "anthropic.claude-3-sonnet-20240229-v1:0")
SIMILARITY_THRESHOLD = float(os.environ.get("SIMILARITY_THRESHOLD", "0.5"))


# ========================================
# Service Initialization
# ========================================
# Initialize services with configured clients and settings
answer_service = AnswerGenerationService(bedrock, LLM_MODEL)


# Only initialize RAG services if enabled
if ENABLE_RAG:
   embedding_service = EmbeddingService(bedrock, EMBEDDING_MODEL)

   # Choose vector store based on environment
   USE_OPENSEARCH = os.environ.get("USE_OPENSEARCH", "false").lower() == "true"

   if USE_OPENSEARCH:
       from services.opensearch_vector_store import OpenSearchVectorStore
       opensearch_endpoint = os.environ.get("OPENSEARCH_ENDPOINT", "")
       if not opensearch_endpoint:
           raise ValueError("OPENSEARCH_ENDPOINT environment variable is required when USE_OPENSEARCH=true")
       vector_store = OpenSearchVectorStore(opensearch_endpoint)
       logger.info(f"RAG mode enabled - using OpenSearch at {opensearch_endpoint}")
   else:
       vector_store = VectorStore(dynamodb.Table(TABLE_NAME))
       logger.info(f"RAG mode enabled - using DynamoDB table {TABLE_NAME}")

   retrieval_service = RetrievalService(embedding_service, vector_store)
else:
   embedding_service = None
   vector_store = None
   retrieval_service = None
   logger.info("RAG mode disabled - using direct LLM responses")




def handler(event: dict[str, Any], context: Any) -> dict[str, Any]:
   """
   Lambda handler for Q&A queries.


   Processes POST requests from API Gateway with a question,
   retrieves relevant document chunks, and generates a grounded answer.


   Args:
       event: API Gateway event containing body with question
       context: Lambda context object (unused)


   Returns:
       API Gateway response with:
       - 200: Answer and sources on success
       - 400: Error message for invalid requests
       - 500: Error message for server errors
   """
   logger.info(f"Received event: {json.dumps(event)}")


   try:
       # ========================================
       # Parse and Validate Request
       # ========================================
       body = json.loads(event.get("body", "{}"))
       question = body.get("question", "").strip()
       top_k = body.get("topK", TOP_K)


       # Validate required fields
       if not question:
           return {
               "statusCode": 400,
               "headers": {
                   "Content-Type": "application/json",
                   "Access-Control-Allow-Origin": "*",
                   "Access-Control-Allow-Headers": "Content-Type,x-api-key",
               },
               "body": json.dumps({"error": "Question is required"}),
           }


       logger.info(f"Processing question: {question} (RAG: {ENABLE_RAG})")


       # ========================================
       # RAG Mode: Retrieve + Generate
       # ========================================
       if ENABLE_RAG and retrieval_service:
           # Step 1 & 2: Embed Query and Retrieve Chunks
           # Convert question to embedding and find similar chunks
           retrieval_result = retrieval_service.retrieve(question, top_k)


           # Handle Empty Results or Low Similarity - Fallback to General Knowledge
           # If no relevant documents found or all scores below threshold, use direct LLM
           chunks = retrieval_result["chunks"]
           if not chunks:
               logger.info("No documents found, falling back to general knowledge")
               answer_result = answer_service.generate(question, [])
               return {
                   "statusCode": 200,
                   "headers": {
                       "Content-Type": "application/json",
                       "Access-Control-Allow-Origin": "*",
                       "Access-Control-Allow-Headers": "Content-Type,x-api-key",
                   },
                   "body": json.dumps(
                       {
                           "answer": answer_result["answer"],
                           "sources": [],
                           "fallback": True,
                           "fallbackReason": "No relevant documents found in knowledge base",
                       }
                   ),
               }

           # Check if best match is below similarity threshold
           best_score = chunks[0].get("score", 0)
           if best_score < SIMILARITY_THRESHOLD:
               logger.info(f"Best similarity score {best_score:.3f} below threshold {SIMILARITY_THRESHOLD}, falling back to general knowledge")
               answer_result = answer_service.generate(question, [])
               return {
                   "statusCode": 200,
                   "headers": {
                       "Content-Type": "application/json",
                       "Access-Control-Allow-Origin": "*",
                       "Access-Control-Allow-Headers": "Content-Type,x-api-key",
                   },
                   "body": json.dumps(
                       {
                           "answer": answer_result["answer"],
                           "sources": [],
                           "fallback": True,
                           "fallbackReason": f"No sufficiently relevant documents found (best match: {best_score:.1%})",
                       }
                   ),
               }


           # Step 3: Generate Answer with Context
           # Use Bedrock Claude to generate grounded answer from context
           answer_result = answer_service.generate(
               question, retrieval_result["chunks"]
           )


           # Format response with sources
           response = {
               "answer": answer_result["answer"],
               "sources": [
                   {
                       "documentId": chunk["documentId"],
                       "chunkIndex": chunk["chunkIndex"],
                       "excerpt": (
                           chunk["content"][:200] + "..."
                           if len(chunk["content"]) > 200
                           else chunk["content"]
                       ),
                       "score": chunk["score"],
                   }
                   for chunk in retrieval_result["chunks"]
               ],
           }


       # ========================================
       # Direct LLM Mode: No Retrieval
       # ========================================
       else:
           # Generate answer directly without context
           # This demonstrates "before RAG" behavior
           answer_result = answer_service.generate(question, [])


           # Format response without sources
           response = {
               "answer": answer_result["answer"],
               "sources": [],
               "mode": "direct-llm",  # Indicate this is non-RAG mode
           }


       # ========================================
       # Return Response
       # ========================================
       return {
           "statusCode": 200,
           "headers": {
               "Content-Type": "application/json",
               "Access-Control-Allow-Origin": "*",
               "Access-Control-Allow-Headers": "Content-Type,x-api-key",
           },
           "body": json.dumps(response),
       }


   except Exception as e:
       logger.error(f"Error processing request: {e}", exc_info=True)


       # Include more details in development/debug mode
       error_detail = (
           str(e) if log_level == "DEBUG" else "Service temporarily unavailable"
       )


       return {
           "statusCode": 500,
           "headers": {
               "Content-Type": "application/json",
               "Access-Control-Allow-Origin": "*",
               "Access-Control-Allow-Headers": "Content-Type,x-api-key",
           },
           "body": json.dumps({"error": error_detail, "type": type(e).__name__}),
       }

Deploy query Lambda function with its logroup in Amazon Cloudwatch

   // ========================================
   // Query Lambda
   // ========================================
   // Handles Q&A requests from API Gateway:
   // 1. Generate query embedding via Bedrock Titan
   // 2. Find similar chunks via cosine similarity
   // 3. Generate answer via Bedrock Claude
   // 4. Return answer with source citations

   // Create log group explicitly so it's managed by CloudFormation
   // and deleted when the stack is destroyed
   const queryLogGroup = new logs.LogGroup(this, 'QueryLogGroup', {
     logGroupName: `/aws/lambda/${resourceName(envName, 'query')}`,
     retention: logRetentionDays as logs.RetentionDays,
     removalPolicy: cdk.RemovalPolicy.DESTROY,
   });

   this.queryFunction = new PythonFunction(this, 'QueryFunction', {
     functionName: resourceName(envName, 'query'),
     entry: 'src/lambdas/query',
     runtime: lambda.Runtime.PYTHON_3_12,
     index: 'handler.py',
     handler: 'handler',
     description: `Query processing function (config: ${configHash})`,
     memorySize: config.lambdaMemorySize || 512,
     timeout: Duration.seconds(config.lambdaTimeout || 30),
     environment: commonEnv,
     logGroup: queryLogGroup,
   });


   // this.queryFunction.addEnvironment('EMBEDDING_MODEL', config.embeddingModel || 'test')
   // ========================================
   // IAM Permissions
   // ========================================
   // Grant least-privilege access to AWS resources


   // Ingestion Lambda: read documents, read/write chunks (only if RAG enabled)
   // Note: Needs read access to query DocumentIndex GSI when deleting old chunks
   this.documentsBucket.grantRead(this.ingestFunction);
   if (enableRag && chunksTable) {
     chunksTable.grantReadWriteData(this.ingestFunction);
   }


   // Query Lambda: read chunks for similarity search (only if RAG enabled)
   if (enableRag && chunksTable) {
     chunksTable.grantReadData(this.queryFunction);
   }


   // Both Lambdas need Bedrock access for embeddings and LLM
   const bedrockPolicy = new iam.PolicyStatement({
     effect: iam.Effect.ALLOW,
     actions: ['bedrock:InvokeModel'],
     resources: ['*'], // Bedrock doesn't support resource-level permissions
   });
   this.ingestFunction.addToRolePolicy(bedrockPolicy);
   this.queryFunction.addToRolePolicy(bedrockPolicy);


   // If using OpenSearch, grant Lambda access to the domain
   if (useOpenSearch && openSearchDomain) {
     const openSearchPolicy = new iam.PolicyStatement({
       effect: iam.Effect.ALLOW,
       actions: [
         'es:ESHttpGet',
         'es:ESHttpPost',
         'es:ESHttpPut',
         'es:ESHttpDelete',
       ],
       resources: [`${openSearchDomain.domainArn}/*`],
     });
     this.ingestFunction.addToRolePolicy(openSearchPolicy);
     this.queryFunction.addToRolePolicy(openSearchPolicy);
   }

2.4 DynamoDB

DynamoDB will serve as the vector storage for the non-production environment (though it is nto a true vector database). We store embeddings as JSON, and similarity searches will be computed in the Lambda function. Although this is not a true vector database, it is simple and cost-effective for small datasets and development environments.
_Note:
At the time of writing, DynamoDB does not support native vector similarity search by itself. Amazon provides “Vector search for Amazon DynamoDB with zero ETL for Amazon OpenSearch Service”, but using OpenSearch incurs addition costs. _

/**
* DynamoDB Stack (Development/Testing Only)
*
* This stack creates and manages DynamoDB tables for the Knowledge Q&A Bot:
* - Chunks table: Stores document chunks with vector embeddings
* - Uses in-memory cosine similarity for vector search
*
* Note: This stack is only deployed for non-production environments.
* Production uses OpenSearch for high-performance vector search.
*
* The stack is separated from the main application stack to allow
* independent lifecycle management of data resources.
*/


import * as cdk from 'aws-cdk-lib';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import { Construct } from 'constructs';


/**
* Props for the DynamoDB Stack
*/
export interface DynamoDBStackProps extends cdk.StackProps {
 /** Name of the CloudFormation stack */
 stackName: string;
 /** Environment name (e.g., 'kate', 'dev', 'prod') */
 envName: string;
 /** Name for the chunks DynamoDB table */
 tableName: string;
}


/**
* Stack that creates DynamoDB tables for document storage
*/
export class DynamoDBStack extends cdk.Stack {
 /** DynamoDB table for storing document chunks and embeddings */
 public readonly chunksTable: dynamodb.Table;


 constructor(scope: Construct, id: string, props: DynamoDBStackProps) {
   super(scope, id, props);


   const { envName } = props;


   // Production environments retain tables on stack deletion
   // Non-production environments auto-delete for easy cleanup
   const isProduction = envName.toLowerCase() === 'prod';
   const tableRemovalPolicy = isProduction
     ? cdk.RemovalPolicy.RETAIN
     : cdk.RemovalPolicy.DESTROY;


   // ========================================
   // Chunks Table
   // ========================================
   // Stores document chunks with their vector embeddings
   // - Partition key: chunkId (format: {documentId}#{chunkIndex})
   // - On-demand billing for cost efficiency (pay per request)
   // - GSI on documentId for efficient document deletion
   this.chunksTable = new dynamodb.Table(this, 'ChunksTable', {
     tableName: props.tableName,
     partitionKey: { name: 'chunkId', type: dynamodb.AttributeType.STRING },
     billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
     removalPolicy: tableRemovalPolicy,
     // Enable point-in-time recovery for production
     ...(isProduction && {
       pointInTimeRecoverySpecification: {
         pointInTimeRecoveryEnabled: true,
       },
     }),
   });


   // ========================================
   // Global Secondary Index
   // ========================================
   // Index for querying all chunks of a document
   // Used when deleting a document to remove all its chunks
   this.chunksTable.addGlobalSecondaryIndex({
     indexName: 'DocumentIndex',
     partitionKey: { name: 'documentId', type: dynamodb.AttributeType.STRING },
   });


   // ========================================
   // Stack Outputs
   // ========================================
   new cdk.CfnOutput(this, 'ChunksTableName', {
     value: this.chunksTable.tableName,
     description: 'DynamoDB table for document chunks',
     exportName: `ChunksTableName-${envName}`,
   });


   new cdk.CfnOutput(this, 'ChunksTableArn', {
     value: this.chunksTable.tableArn,
     description: 'DynamoDB table ARN for document chunks',
     exportName: `ChunksTableArn-${envName}`,
   });
 }
}

2.5 Cloudfront

For frontend distribution with caching and https support

/**
* CloudFront Stack (Production Only)
*
* This stack creates a CloudFront distribution for the frontend.
*
* Note: This stack is only deployed for production environments.
* Development/testing environments serve frontend directly from S3.
*
* CloudFront provides:
* - HTTPS support with AWS-managed certificate
* - Global CDN for fast access worldwide
* - Custom domain support (optional)
* - Better caching and performance
*/


import * as cdk from 'aws-cdk-lib';
import * as cloudfront from 'aws-cdk-lib/aws-cloudfront';
import * as origins from 'aws-cdk-lib/aws-cloudfront-origins';
import * as s3 from 'aws-cdk-lib/aws-s3';
import { Construct } from 'constructs';


/**
* Props for the CloudFront Stack
*/
export interface CloudFrontStackProps extends cdk.StackProps {
 /** Name of the CloudFormation stack */
 stackName: string;
 /** Environment name (e.g., 'kate', 'dev', 'prod') */
 envName: string;
 /** S3 bucket for frontend (from S3BucketsStack) */
 frontendBucket: s3.IBucket;
}


/**
* Stack that creates CloudFront distribution for frontend
*/
export class CloudFrontStack extends cdk.Stack {
 /** CloudFront distribution for frontend */
 public readonly distribution: cloudfront.Distribution;
 /** CloudFront domain name */
 public readonly distributionDomainName: string;


 constructor(scope: Construct, id: string, props: CloudFrontStackProps) {
   super(scope, id, props);


   const { envName, frontendBucket } = props;


   // ========================================
   // CloudFront Distribution
   // ========================================
   // CDN for frontend with HTTPS and global edge locations
   this.distribution = new cloudfront.Distribution(this, 'FrontendDistribution', {
     comment: `Knowledge Q&A Bot Frontend - ${envName}`,

     // Origin: S3 bucket with website hosting
     defaultBehavior: {
       origin: new origins.S3Origin(frontendBucket),
       viewerProtocolPolicy: cloudfront.ViewerProtocolPolicy.REDIRECT_TO_HTTPS,
       allowedMethods: cloudfront.AllowedMethods.ALLOW_GET_HEAD,
       cachedMethods: cloudfront.CachedMethods.CACHE_GET_HEAD,
       compress: true,

       // Cache policy for static content
       cachePolicy: cloudfront.CachePolicy.CACHING_OPTIMIZED,
     },


     // Default root object
     defaultRootObject: 'index.html',


     // Error responses
     errorResponses: [
       {
         httpStatus: 404,
         responseHttpStatus: 200,
         responsePagePath: '/index.html',
         ttl: cdk.Duration.minutes(5),
       },
     ],


     // Price class - use all edge locations for prod, cheaper for dev
     priceClass: envName.toLowerCase() === 'prod'
       ? cloudfront.PriceClass.PRICE_CLASS_ALL
       : cloudfront.PriceClass.PRICE_CLASS_100,


     // Enable IPv6
     enableIpv6: true,
   });


   this.distributionDomainName = this.distribution.distributionDomainName;


   // ========================================
   // Stack Outputs
   // ========================================
   new cdk.CfnOutput(this, 'DistributionId', {
     value: this.distribution.distributionId,
     description: 'CloudFront distribution ID',
     exportName: `CloudFrontDistributionId-${envName}`,
   });


   new cdk.CfnOutput(this, 'DistributionDomainName', {
     value: this.distribution.distributionDomainName,
     description: 'CloudFront distribution domain name',
     exportName: `CloudFrontDomainName-${envName}`,
   });


   new cdk.CfnOutput(this, 'FrontendUrl', {
     value: `https://${this.distribution.distributionDomainName}`,
     description: 'Frontend URL (HTTPS)',
     exportName: `FrontendUrl-${envName}`,
   });
 }
}

2.6 OpenSearch

True vector database, faster vector search

/**
* OpenSearch Stack (Production Only)
*
* This stack creates an OpenSearch domain for high-performance vector similarity search.
* OpenSearch provides native k-NN (k-nearest neighbors) support for
* efficient vector search at scale.
*
* Note: This stack is only deployed for production environments.
* Development/testing environments use DynamoDB with in-memory cosine similarity.
*
* Features:
* - k-NN plugin enabled for vector search
* - Fine-grained access control
* - Encryption at rest and in transit
* - Single-node configuration for cost efficiency (scale up as needed)
*/


import * as cdk from 'aws-cdk-lib';
import * as opensearch from 'aws-cdk-lib/aws-opensearchservice';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as iam from 'aws-cdk-lib/aws-iam';
import { Construct } from 'constructs';


/**
* Props for the OpenSearch Stack
*/
export interface OpenSearchStackProps extends cdk.StackProps {
 /** Name of the CloudFormation stack */
 stackName: string;
 /** Environment name (e.g., 'kate', 'dev', 'prod') */
 envName: string;
 /** Domain name for OpenSearch */
 domainName: string;
}


/**
* Stack that creates OpenSearch domain for vector search
*/
export class OpenSearchStack extends cdk.Stack {
 /** OpenSearch domain for vector similarity search */
 public readonly domain: opensearch.Domain;
 /** Domain endpoint URL */
 public readonly domainEndpoint: string;


 constructor(scope: Construct, id: string, props: OpenSearchStackProps) {
   super(scope, id, props);


   const { envName } = props;
   const region = props.env?.region || 'us-east-1';
   const accountId = props.env?.account || cdk.Aws.ACCOUNT_ID;


   // Production environments use larger instances and retain on deletion
   const isProduction = envName.toLowerCase() === 'prod';


   // ========================================
   // OpenSearch Domain
   // ========================================
   // Domain for storing and searching vector embeddings
   // - k-NN plugin enabled for vector similarity search
   // - t3.small.search for cost-effective prototype/dev
   // - Single node for dev, multi-node for prod
   this.domain = new opensearch.Domain(this, 'VectorSearchDomain', {
     domainName: props.domainName,
     version: opensearch.EngineVersion.OPENSEARCH_2_11,

     // Capacity configuration
     capacity: {
       dataNodes: 1,
       dataNodeInstanceType: 't3.small.search',
       // No dedicated master nodes for cost-effective single-node setup
       // For production scale, use 3+ master nodes and 2+ data nodes
       multiAzWithStandbyEnabled: false, // T3 instances don't support Multi-AZ with standby
     },


     // Storage configuration
     ebs: {
       volumeSize: isProduction ? 100 : 20, // GB
       volumeType: ec2.EbsDeviceVolumeType.GP3,
     },


     // Security configuration
     enforceHttps: true,
     nodeToNodeEncryption: true,
     encryptionAtRest: {
       enabled: true,
     },


     // Access policy - allow IAM authenticated access from this account
     // This allows Lambda functions with proper IAM permissions to access
     // Using explicit actions instead of es:* to force CDK update
     accessPolicies: [
       new iam.PolicyStatement({
         effect: iam.Effect.ALLOW,
         principals: [
           new iam.ArnPrincipal(`arn:aws:iam::${accountId}:root`)
         ],
         actions: [
           'es:ESHttpDelete',
           'es:ESHttpGet',
           'es:ESHttpHead',
           'es:ESHttpPost',
           'es:ESHttpPut',
           'es:ESHttpPatch'
         ],
         resources: [`arn:aws:es:${region}:${accountId}:domain/${props.domainName}/*`],
       }),
     ],


     // Fine-grained access control disabled for simplicity
     // Enable in production with proper user/role configuration
     // fineGrainedAccessControl: {
     //   masterUserArn: `arn:aws:iam::${accountId}:root`,
     // },


     // Removal policy
     removalPolicy: isProduction ? cdk.RemovalPolicy.DESTROY : cdk.RemovalPolicy.DESTROY,
   });


   this.domainEndpoint = this.domain.domainEndpoint;


   // ========================================
   // Stack Outputs
   // ========================================
   // Export domain endpoint for cross-stack reference
   const endpointExport = new cdk.CfnOutput(this, 'DomainEndpoint', {
     value: this.domain.domainEndpoint,
     description: 'OpenSearch domain endpoint',
     exportName: `OpenSearchEndpoint-${envName}`,
   });


   // Export domain ARN for IAM policies
   const arnExport = new cdk.CfnOutput(this, 'DomainArn', {
     value: this.domain.domainArn,
     description: 'OpenSearch domain ARN',
     exportName: `OpenSearchArn-${envName}`,
   });


   new cdk.CfnOutput(this, 'DomainName', {
     value: this.domain.domainName,
     description: 'OpenSearch domain name',
     exportName: `OpenSearchDomainName-${envName}`,
   });
 }
}

Demo Result (Before vs After Upload documents to the knowledge base)

LLM Without the knowledge base

LLM With Sucessfull Retrieval

Final Thoughts

Now we have a Q&A bot with our own knowledge base. We can easily update our bot with latest documents without retraining the model, and even use our private data.
AWS provide option that we can simply connect and AWS S3 bucket to Amazon Bedrock, allowing AWS to handle the heavy lifting for you. But be aware of potential costs the end of month. For more detail refer to Connect to Amazon S3 for your knowledge base

Reference

Choosing an AWS vector database for RAG use cases - AWS Prescriptive Guidance
What is RAG? - Retrieval-Augmented Generation AI Explained - AWS
Vector search for Amazon DynamoDB with zero ETL for Amazon OpenSearch Service
AWS announces Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service - AWS
Vector database options - AWS Prescriptive Guidance
Connect to Amazon S3 for your knowledge base
Louis. Building LLMs for Production
Jay, Maarten. Hand-on Large Language Models
Julien, Hanza, & Antonio. LLM Engineer’s Handbook

Build Your Own Private AI Image Generator on AWS with AWS Bedrock with Multiple Models and AWS CDK

Kate Vu — Fri, 21 Nov 2025 13:49:11 +0000

There are many excellent image generation tools available today. Many of them are free and easy to use. However, your prompts and images are processed outside your environment on platforms that you do not control. In this project, we will build a fully private image generator app running entirely inside AWS. The images remain in your S3 bucket, and you can switch between Bedrock foundation models such as Titan and Stable Diffusion.

What this app can do

Text-to-Image Enter a prompt such as "an orange cat sitting next to the door looking at the rain outside" and the app will generate an image
Image-to-Image Upload an image and a new prompt, for example:
- Image of Leo the orange cat
- With prompt: convert to Leo the knight cat The model will transform the image based on the prompt
Multiple Bedrock Image Models Simply switch between two models on website hosted in S3 bucket:
- Amazon Titan Image Generator v2 (default)
- Stable Diffusion 3.5 Large Future models can be added easily. AWS Lambda works as back end will detect which model is selected by user and send the request with model id to AWS Bedrock

Architecture

The user sends a request via website hosted on AWS S3 bucket with the following inputs:
- Text-to-Image or Image-to-Image
- Model to use
- Prompt and uploaded image for image to image
The request goes to API gateway
API Gateway invokes Lambda function
Lambda function:
- Validate input
- Retrieve the selected model
- Build the Bedrock request
- Send the request to Bedrock via InvokeModel API
Bedrock generates the image and returns it as base64 string format
Lambda function:
- Receive the generated image as base64 string format, and save it to S3
- Generate a pre-signed URL for downloading
- Publish performance metric to CloudWatch
- Return the image and URL to frontend
The frontend displays the image along with the download link.

AWS Resources:

AWS S3 Buckets
Frontend bucket
Image bucket
API Gateway
AWS Lambda
Amazon Bedrock
IAM
CloudWatch

Prerequisites:

AWS Account: you will need it to deploy S3, lambda, API Gateway, Bedrock, CloudWatch, and IAM
Environment setup: Make sure these are set installed and working
- Note.js
- Typescript
- AWS CDK Toolkit
- Docker: up and running, we will use this to bundle our lambda function
- AWS Credentials: keep them handy so you can deploy the stacks

Deploy

1. Get the modelIDs

Go to the AWS Bedrock console and get the exact model IDs. For this app, we are using

Amazon.titan-image-generator-v2:0
stability.sd3-large-v1:0 in us-west-2.

Make sure these models are available in your selected region. For information about models and region, refer to Model lifecycle - Amazon Bedrock

2. Create the resources

2.1 Setup frontend

Create two HTML files:
Index.html: main interface allowed user to

Switch between Text-to-Image or Image-to-Image
Switch between models: Amazon Titan image generator (default) or Stable Diffusion
Submit prompts and images Error.html: a simple page displayed if something goes wrong index.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>AI Image Generator</title>
    <style>
        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }

        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
            background: linear-gradient(135deg, #10b981 0%, #059669 100%);
            min-height: 100vh;
            padding: 20px;
        }

        .container {
            max-width: 1200px;
            margin: 0 auto;
        }

        h1 {
            text-align: center;
            color: white;
            margin-bottom: 30px;
            font-size: 2.5rem;
        }

        .card {
            background: white;
            border-radius: 12px;
            padding: 30px;
            box-shadow: 0 10px 40px rgba(0,0,0,0.1);
            margin-bottom: 20px;
        }

        .tabs {
            display: flex;
            gap: 10px;
            margin-bottom: 30px;
        }

        .tab {
            flex: 1;
            padding: 15px;
            background: #f5f5f5;
            border: none;
            border-radius: 8px;
            cursor: pointer;
            font-size: 16px;
            font-weight: 600;
            transition: all 0.3s;
        }

        .tab.active {
            background: #10b981;
            color: white;
        }

        .tab-content {
            display: none;
        }

        .tab-content.active {
            display: block;
        }

        .form-group {
            margin-bottom: 20px;
        }

        label {
            display: block;
            margin-bottom: 8px;
            font-weight: 600;
            color: #333;
        }

        input[type="text"],
        textarea {
            width: 100%;
            padding: 12px;
            border: 2px solid #e0e0e0;
            border-radius: 8px;
            font-size: 16px;
            transition: border-color 0.3s;
        }

        input[type="text"]:focus,
        textarea:focus {
            outline: none;
            border-color: #10b981;
        }

        textarea {
            resize: vertical;
            min-height: 100px;
        }

        .size-inputs {
            display: grid;
            grid-template-columns: 1fr 1fr;
            gap: 15px;
        }

        input[type="number"] {
            width: 100%;
            padding: 12px;
            border: 2px solid #e0e0e0;
            border-radius: 8px;
            font-size: 16px;
        }

        .file-upload {
            border: 2px dashed #e0e0e0;
            border-radius: 8px;
            padding: 30px;
            text-align: center;
            cursor: pointer;
            transition: all 0.3s;
        }

        .file-upload:hover {
            border-color: #10b981;
            background: #f0fdf4;
        }

        .file-upload.dragover {
            border-color: #10b981;
            background: #d1fae5;
            border-style: solid;
        }

        .file-upload input {
            display: none;
        }

        .preview-image {
            max-width: 100%;
            max-height: 300px;
            margin-top: 15px;
            border-radius: 8px;
        }

        button {
            width: 100%;
            padding: 15px;
            background: #10b981;
            color: white;
            border: none;
            border-radius: 8px;
            font-size: 18px;
            font-weight: 600;
            cursor: pointer;
            transition: all 0.3s;
        }

        button:hover {
            background: #059669;
            transform: translateY(-2px);
            box-shadow: 0 5px 15px rgba(16, 185, 129, 0.4);
        }

        button:disabled {
            background: #ccc;
            cursor: not-allowed;
            transform: none;
        }

        .result {
            margin-top: 30px;
        }

        .result-image {
            width: 100%;
            border-radius: 8px;
            margin-top: 15px;
        }

        .status {
            padding: 15px;
            border-radius: 8px;
            margin-top: 15px;
            font-weight: 600;
        }

        .status.success {
            background: #d4edda;
            color: #155724;
        }

        .status.error {
            background: #f8d7da;
            color: #721c24;
        }

        .status.loading {
            background: #d1ecf1;
            color: #0c5460;
        }

        .loader {
            border: 4px solid #f3f3f3;
            border-top: 4px solid #10b981;
            border-radius: 50%;
            width: 40px;
            height: 40px;
            animation: spin 1s linear infinite;
            margin: 20px auto;
        }

        @keyframes spin {
            0% { transform: rotate(0deg); }
            100% { transform: rotate(360deg); }
        }

        @keyframes slideIn {
            from {
                opacity: 0;
                transform: translateY(-10px);
            }
            to {
                opacity: 1;
                transform: translateY(0);
            }
        }

        .privacy-note {
            background: #e8f4f8;
            padding: 15px;
            border-radius: 8px;
            margin-top: 20px;
            font-size: 14px;
            color: #0c5460;
        }
    </style>
</head>
<body>
    <div class="container">
        <h1>🎨 AI Image Generator</h1>

        <div class="card">
            <div class="tabs">
                <button class="tab active" onclick="switchTab('text-to-image')">Text to Image</button>
                <button class="tab" onclick="switchTab('image-to-image')">Image to Image</button>
            </div>

            <!-- Text to Image Tab -->
            <div id="text-to-image" class="tab-content active">
                <form onsubmit="generateTextToImage(event)">
                    <div class="form-group">
                        <label>API Endpoint</label>
                        <input type="text" id="api-endpoint-text" placeholder="https://your-api-gateway-url/generate" required>
                    </div>

                    <div class="form-group">
                        <label>API Key (Required)</label>
                        <input type="password" id="api-key-text" placeholder="Your API key" required>
                        <small style="color: #666; font-size: 12px; margin-top: 5px; display: block;">Get your API key from CDK output or AWS Console</small>
                    </div>

                    <div class="form-group">
                        <label>Model</label>
                        <select id="model-text" style="width: 100%; padding: 12px; border: 2px solid #e0e0e0; border-radius: 8px; font-size: 16px;">
                            <option value="amazon.titan-image-generator-v2:0">Amazon Titan Image Generator v2 (Recommended)</option>
                            <option value="stability.sd3-5-large-v1:0">Stable Diffusion 3.5 Large (Higher Quality)</option>
                        </select>
                        <small style="color: #666; font-size: 12px;">Titan: $0.008-$0.010/image | SD 3.5: $0.065/image</small>
                    </div>

                    <div class="form-group">
                        <label>Prompt</label>
                        <textarea id="prompt-text" placeholder="Describe the image you want to generate..." required></textarea>
                        <div style="margin-top: 8px; display: flex; gap: 8px;">
                            <button type="button" onclick="showPromptTemplates()" 
                                    style="width: auto; padding: 8px 15px; font-size: 14px; background: #6c757d;">
                                💡 Example Prompts
                            </button>
                            <button type="button" onclick="showStylePresets('text')" 
                                    style="width: auto; padding: 8px 15px; font-size: 14px; background: #8b5cf6;">
                                🎨 Style Presets
                            </button>
                        </div>
                    </div>

                    <div class="form-group">
                        <label>Negative Prompt (Optional - what to avoid)</label>
                        <textarea id="negative-prompt-text" placeholder="e.g., blurry, low quality, distorted, ugly..." rows="2"></textarea>
                        <small style="color: #666; font-size: 12px; margin-top: 5px; display: block;">Specify what you DON'T want in the image</small>
                    </div>

                    <div class="form-group">
                        <label>Image Size</label>
                        <div class="size-inputs">
                            <div>
                                <label>Width</label>
                                <input type="number" id="width-text" value="1024" min="512" max="2048" step="64">
                            </div>
                            <div>
                                <label>Height</label>
                                <input type="number" id="height-text" value="1024" min="512" max="2048" step="64">
                            </div>
                        </div>
                        <div style="margin-top: 10px; display: flex; gap: 5px; flex-wrap: wrap;">
                            <button type="button" onclick="setDimensions(1024, 1024, 'text')" style="width: auto; padding: 5px 10px; font-size: 12px;">1:1 (1024x1024)</button>
                            <button type="button" onclick="setDimensions(1344, 768, 'text')" style="width: auto; padding: 5px 10px; font-size: 12px;">16:9 (1344x768)</button>
                            <button type="button" onclick="setDimensions(768, 1344, 'text')" style="width: auto; padding: 5px 10px; font-size: 12px;">9:16 (768x1344)</button>
                            <button type="button" onclick="setDimensions(1216, 832, 'text')" style="width: auto; padding: 5px 10px; font-size: 12px;">3:2 (1216x832)</button>
                        </div>
                        <small style="color: #666; font-size: 12px; margin-top: 5px; display: block;">Quick presets for common sizes (SD 3.5 compatible)</small>
                    </div>

                    <button type="submit" id="btn-text">Generate Image</button>
                </form>

                <div id="result-text" class="result"></div>
            </div>

            <!-- Image to Image Tab -->
            <div id="image-to-image" class="tab-content">
                <form onsubmit="generateImageToImage(event)">
                    <div class="form-group">
                        <label>API Endpoint</label>
                        <input type="text" id="api-endpoint-image" placeholder="https://your-api-gateway-url/generate" required>
                    </div>

                    <div class="form-group">
                        <label>API Key (Required)</label>
                        <input type="password" id="api-key-image" placeholder="Your API key" required>
                        <small style="color: #666; font-size: 12px; margin-top: 5px; display: block;">Get your API key from CDK output or AWS Console</small>
                    </div>

                    <div class="form-group">
                        <label>Model</label>
                        <select id="model-image" style="width: 100%; padding: 12px; border: 2px solid #e0e0e0; border-radius: 8px; font-size: 16px;">
                            <option value="amazon.titan-image-generator-v2:0">Amazon Titan Image Generator v2 (Recommended)</option>
                            <option value="stability.sd3-5-large-v1:0">Stable Diffusion 3.5 Large (Higher Quality)</option>
                        </select>
                        <small style="color: #666; font-size: 12px;">Both models support image-to-image transformation</small>
                    </div>

                    <div class="form-group">
                        <label>Upload Image</label>
                        <div class="file-upload" id="drop-zone" onclick="document.getElementById('image-input').click()">
                            <input type="file" id="image-input" accept="image/*" onchange="previewImage(event)">
                            <p id="upload-text">Click to upload or drag and drop</p>
                            <p style="font-size: 14px; color: #666; margin-top: 5px;">PNG, JPEG, WebP (max 5 MB)</p>
                            <img id="preview" class="preview-image" style="display: none;">
                        </div>
                    </div>

                    <div class="form-group">
                        <label>Prompt</label>
                        <textarea id="prompt-image" placeholder="Describe how you want to transform the image..." required></textarea>
                        <div style="margin-top: 8px; display: flex; gap: 8px;">
                            <button type="button" onclick="showPromptTemplates()" 
                                    style="width: auto; padding: 8px 15px; font-size: 14px; background: #6c757d;">
                                💡 Example Prompts
                            </button>
                            <button type="button" onclick="showStylePresets('image')" 
                                    style="width: auto; padding: 8px 15px; font-size: 14px; background: #8b5cf6;">
                                🎨 Style Presets
                            </button>
                        </div>
                    </div>

                    <div class="form-group">
                        <label>Negative Prompt (Optional - what to avoid)</label>
                        <textarea id="negative-prompt-image" placeholder="e.g., blurry, low quality, distorted, ugly..." rows="2"></textarea>
                        <small style="color: #666; font-size: 12px; margin-top: 5px; display: block;">Specify what you DON'T want in the image</small>
                    </div>

                    <div class="form-group">
                        <label>Image Size</label>
                        <div class="size-inputs">
                            <div>
                                <label>Width</label>
                                <input type="number" id="width-image" value="1024" min="320" max="2048" step="64">
                            </div>
                            <div>
                                <label>Height</label>
                                <input type="number" id="height-image" value="1024" min="320" max="2048" step="64">
                            </div>
                        </div>
                        <div style="margin-top: 10px; display: flex; gap: 5px; flex-wrap: wrap;">
                            <button type="button" onclick="setDimensions(1024, 1024, 'image')" style="width: auto; padding: 5px 10px; font-size: 12px;">1:1 (1024x1024)</button>
                            <button type="button" onclick="setDimensions(1344, 768, 'image')" style="width: auto; padding: 5px 10px; font-size: 12px;">16:9 (1344x768)</button>
                            <button type="button" onclick="setDimensions(768, 1344, 'image')" style="width: auto; padding: 5px 10px; font-size: 12px;">9:16 (768x1344)</button>
                            <button type="button" onclick="setDimensions(512, 512, 'image')" style="width: auto; padding: 5px 10px; font-size: 12px;">Small (512x512)</button>
                        </div>
                        <small style="color: #666; font-size: 12px; margin-top: 5px; display: block;">Quick presets (Titan v2 only)</small>
                    </div>

                    <button type="submit" id="btn-image">Transform Image</button>
                </form>

                <div id="result-image" class="result"></div>
            </div>

            <div class="privacy-note">
                🔒 <strong>Privacy:</strong> All images are generated using AWS Bedrock in your AWS account. 
                No data is sent to third parties. Images are stored privately in your S3 bucket with encryption.
            </div>
        </div>
    </div>

    <script>
        let uploadedImageBase64 = null;

        // Load saved API endpoint and show recent images on page load
        document.addEventListener('DOMContentLoaded', function() {
            // Load saved API endpoint
            const savedEndpoint = localStorage.getItem('apiEndpoint');
            if (savedEndpoint) {
                document.getElementById('api-endpoint-text').value = savedEndpoint;
                document.getElementById('api-endpoint-image').value = savedEndpoint;
            }

            // Load saved API key
            const savedApiKey = localStorage.getItem('apiKey');
            if (savedApiKey) {
                document.getElementById('api-key-text').value = savedApiKey;
                document.getElementById('api-key-image').value = savedApiKey;
            }

            // Show recent images
            showRecentImages();

            // Load example prompts
            loadPromptTemplates();
        });

        function switchTab(tabName) {
            document.querySelectorAll('.tab').forEach(tab => tab.classList.remove('active'));
            document.querySelectorAll('.tab-content').forEach(content => content.classList.remove('active'));

            event.target.classList.add('active');
            document.getElementById(tabName).classList.add('active');
        }

        function saveApiEndpoint(endpoint) {
            localStorage.setItem('apiEndpoint', endpoint);
        }

        function saveApiKey(apiKey) {
            if (apiKey) {
                localStorage.setItem('apiKey', apiKey);
            }
        }

        function saveToHistory(imageUrl, filename, prompt, modelId) {
            const history = JSON.parse(localStorage.getItem('imageHistory') || '[]');
            history.unshift({
                url: imageUrl,
                filename: filename,
                prompt: prompt,
                modelId: modelId,
                timestamp: new Date().toISOString()
            });
            // Keep only last 10 images
            if (history.length > 10) history.pop();
            localStorage.setItem('imageHistory', JSON.stringify(history));
        }

        function showRecentImages() {
            const history = JSON.parse(localStorage.getItem('imageHistory') || '[]');
            if (history.length === 0) return;

            const historyHtml = `
                <div style="margin-top: 30px; padding: 20px; background: #f9f9f9; border-radius: 8px;">
                    <h3 style="margin-bottom: 15px; color: #333;">Recent Generations</h3>
                    <div style="display: grid; grid-template-columns: repeat(auto-fill, minmax(150px, 1fr)); gap: 10px;">
                        ${history.slice(0, 5).map(item => `
                            <div style="text-align: center;">
                                <img src="${item.url}" style="width: 100%; border-radius: 4px; cursor: pointer;" 
                                     onclick="window.open('${item.url}', '_blank')" 
                                     title="${item.prompt.substring(0, 50)}...">
                                <small style="display: block; margin-top: 5px; color: #666; font-size: 11px;">
                                    ${new Date(item.timestamp).toLocaleDateString()}
                                </small>
                            </div>
                        `).join('')}
                    </div>
                    <button type="button" onclick="clearHistory()" 
                            style="margin-top: 15px; width: auto; padding: 8px 15px; font-size: 14px; background: #dc3545;">
                        Clear History
                    </button>
                </div>
            `;

            // Add to both result divs if they're empty
            const resultText = document.getElementById('result-text');
            const resultImage = document.getElementById('result-image');
            if (!resultText.innerHTML) resultText.innerHTML = historyHtml;
            if (!resultImage.innerHTML) resultImage.innerHTML = historyHtml;
        }

        function clearHistory() {
            if (confirm('Clear all image history?')) {
                localStorage.removeItem('imageHistory');
                document.getElementById('result-text').innerHTML = '';
                document.getElementById('result-image').innerHTML = '';
            }
        }

        function loadPromptTemplates() {
            const templates = [
                "A serene mountain landscape at sunset with vibrant colors",
                "A futuristic city with flying cars and neon lights",
                "A cozy coffee shop interior with warm lighting",
                "An astronaut floating in space with Earth in background",
                "A magical forest with glowing mushrooms and fireflies",
                "A steampunk airship flying through clouds",
                "A minimalist modern living room with large windows",
                "A cyberpunk street market at night with rain",
                "A peaceful zen garden with cherry blossoms",
                "An underwater scene with colorful coral and fish"
            ];

            window.promptTemplates = templates;

            // Style presets with prompt modifiers
            window.stylePresets = {
                "Photorealistic": {
                    suffix: ", photorealistic, highly detailed, 8k, professional photography",
                    negative: "cartoon, anime, painting, drawing, illustration, low quality"
                },
                "Digital Art": {
                    suffix: ", digital art, artstation, concept art, smooth, sharp focus",
                    negative: "photo, photograph, realistic, low quality, blurry"
                },
                "Oil Painting": {
                    suffix: ", oil painting, canvas, brushstrokes, artistic, masterpiece",
                    negative: "photo, digital, 3d render, low quality"
                },
                "Anime": {
                    suffix: ", anime style, manga, cel shaded, vibrant colors",
                    negative: "realistic, photo, 3d, western cartoon, low quality"
                },
                "Watercolor": {
                    suffix: ", watercolor painting, soft colors, artistic, flowing",
                    negative: "photo, digital, harsh lines, low quality"
                },
                "3D Render": {
                    suffix: ", 3d render, octane render, unreal engine, highly detailed",
                    negative: "2d, flat, painting, sketch, low quality"
                },
                "Sketch": {
                    suffix: ", pencil sketch, hand drawn, artistic, detailed linework",
                    negative: "photo, color, painted, low quality"
                },
                "Cyberpunk": {
                    suffix: ", cyberpunk style, neon lights, futuristic, dystopian, high tech",
                    negative: "medieval, natural, rustic, low quality"
                },
                "Fantasy": {
                    suffix: ", fantasy art, magical, ethereal, epic, detailed",
                    negative: "modern, realistic, mundane, low quality"
                },
                "Minimalist": {
                    suffix: ", minimalist, clean, simple, modern, elegant",
                    negative: "cluttered, busy, complex, ornate, low quality"
                }
            };
        }

        function applyStylePreset(style, tabType) {
            const preset = window.stylePresets[style];
            if (!preset) return;

            const promptField = tabType === 'text' 
                ? document.getElementById('prompt-text')
                : document.getElementById('prompt-image');

            const negativeField = tabType === 'text'
                ? document.getElementById('negative-prompt-text')
                : document.getElementById('negative-prompt-image');

            // Add style suffix to prompt if not already there
            let currentPrompt = promptField.value.trim();
            if (currentPrompt && !currentPrompt.includes(preset.suffix)) {
                promptField.value = currentPrompt + preset.suffix;
            }

            // Set negative prompt
            if (negativeField && !negativeField.value.trim()) {
                negativeField.value = preset.negative;
            }

            // Show selected style indicator
            showStyleIndicator(style, tabType);
        }

        function showStyleIndicator(style, tabType) {
            const indicatorId = `style-indicator-${tabType}`;

            // Remove existing indicator
            const existing = document.getElementById(indicatorId);
            if (existing) existing.remove();

            // Create new indicator
            const indicator = document.createElement('div');
            indicator.id = indicatorId;
            indicator.style.cssText = `
                margin-top: 10px;
                padding: 10px 15px;
                background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
                color: white;
                border-radius: 8px;
                display: flex;
                align-items: center;
                justify-content: space-between;
                font-size: 14px;
                animation: slideIn 0.3s ease-out;
            `;

            indicator.innerHTML = `
                <span>🎨 <strong>Style:</strong> ${style}</span>
                <button onclick="resetStylePreset('${tabType}')" 
                        style="
                            background: rgba(255,255,255,0.2);
                            color: white;
                            border: none;
                            padding: 5px 12px;
                            border-radius: 5px;
                            cursor: pointer;
                            font-size: 12px;
                            transition: background 0.2s;
                        "
                        onmouseover="this.style.background='rgba(255,255,255,0.3)'"
                        onmouseout="this.style.background='rgba(255,255,255,0.2)'">
                    ✕ Reset
                </button>
            `;

            // Insert after the style presets button
            const buttonContainer = tabType === 'text'
                ? document.getElementById('prompt-text').nextElementSibling
                : document.getElementById('prompt-image').nextElementSibling;

            buttonContainer.parentNode.insertBefore(indicator, buttonContainer.nextSibling);
        }

        function resetStylePreset(tabType) {
            const promptField = tabType === 'text' 
                ? document.getElementById('prompt-text')
                : document.getElementById('prompt-image');

            const negativeField = tabType === 'text'
                ? document.getElementById('negative-prompt-text')
                : document.getElementById('negative-prompt-image');

            // Remove style suffixes from prompt
            let currentPrompt = promptField.value;
            Object.values(window.stylePresets || {}).forEach(preset => {
                currentPrompt = currentPrompt.replace(preset.suffix, '').trim();
            });
            promptField.value = currentPrompt;

            // Clear negative prompt if it matches a preset
            const currentNegative = negativeField.value;
            const isPresetNegative = Object.values(window.stylePresets || {})
                .some(preset => preset.negative === currentNegative);
            if (isPresetNegative) {
                negativeField.value = '';
            }

            // Remove indicator
            const indicator = document.getElementById(`style-indicator-${tabType}`);
            if (indicator) indicator.remove();
        }

        function showStylePresets(tabType) {
            const styles = Object.keys(window.stylePresets || {});

            // Create modal HTML
            const modalHtml = `
                <div id="style-modal" style="
                    position: fixed;
                    top: 0;
                    left: 0;
                    width: 100%;
                    height: 100%;
                    background: rgba(0,0,0,0.7);
                    display: flex;
                    align-items: center;
                    justify-content: center;
                    z-index: 1000;
                " onclick="if(event.target.id === 'style-modal') this.remove()">
                    <div style="
                        background: white;
                        border-radius: 12px;
                        padding: 30px;
                        max-width: 600px;
                        max-height: 80vh;
                        overflow-y: auto;
                        box-shadow: 0 20px 60px rgba(0,0,0,0.3);
                    " onclick="event.stopPropagation()">
                        <h2 style="margin: 0 0 20px 0; color: #333;">🎨 Choose a Style Preset</h2>
                        <div style="display: grid; gap: 10px;">
                            ${styles.map((style, i) => {
                                const preset = window.stylePresets[style];
                                return `
                                    <button onclick="applyStylePreset('${style}', '${tabType}'); document.getElementById('style-modal').remove();" 
                                            style="
                                                padding: 15px;
                                                background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
                                                color: white;
                                                border: none;
                                                border-radius: 8px;
                                                cursor: pointer;
                                                text-align: left;
                                                transition: transform 0.2s;
                                                font-size: 14px;
                                            "
                                            onmouseover="this.style.transform='translateY(-2px)'"
                                            onmouseout="this.style.transform='translateY(0)'">
                                        <strong style="font-size: 16px; display: block; margin-bottom: 5px;">${style}</strong>
                                        <small style="opacity: 0.9; display: block;">Adds: ${preset.suffix.substring(0, 60)}...</small>
                                    </button>
                                `;
                            }).join('')}
                        </div>
                        <button onclick="document.getElementById('style-modal').remove()" 
                                style="
                                    margin-top: 20px;
                                    width: 100%;
                                    padding: 12px;
                                    background: #6c757d;
                                    color: white;
                                    border: none;
                                    border-radius: 8px;
                                    cursor: pointer;
                                    font-size: 14px;
                                ">
                            Cancel
                        </button>
                    </div>
                </div>
            `;

            // Add modal to page
            document.body.insertAdjacentHTML('beforeend', modalHtml);
        }

        function showPromptTemplates() {
            const templates = window.promptTemplates || [];
            const templateList = templates.map((t, i) => `${i + 1}. ${t}`).join('\n');
            const selected = prompt(`Choose a prompt template (enter number 1-${templates.length}):\n\n${templateList}`);

            if (selected && !isNaN(selected)) {
                const index = parseInt(selected) - 1;
                if (index >= 0 && index < templates.length) {
                    const activeTab = document.querySelector('.tab-content.active');
                    const promptField = activeTab.querySelector('textarea');
                    if (promptField) {
                        promptField.value = templates[index];
                    }
                }
            }
        }

        // Initialize drag and drop after DOM is loaded
        document.addEventListener('DOMContentLoaded', function() {
            const dropZone = document.getElementById('drop-zone');

            if (dropZone) {
                ['dragenter', 'dragover', 'dragleave', 'drop'].forEach(eventName => {
                    dropZone.addEventListener(eventName, preventDefaults, false);
                });

                ['dragenter', 'dragover'].forEach(eventName => {
                    dropZone.addEventListener(eventName, highlight, false);
                });

                ['dragleave', 'drop'].forEach(eventName => {
                    dropZone.addEventListener(eventName, unhighlight, false);
                });

                dropZone.addEventListener('drop', handleDrop, false);
            }
        });

        function preventDefaults(e) {
            e.preventDefault();
            e.stopPropagation();
        }

        function highlight(e) {
            e.currentTarget.classList.add('dragover');
        }

        function unhighlight(e) {
            e.currentTarget.classList.remove('dragover');
        }

        function handleDrop(e) {
            const dt = e.dataTransfer;
            const files = dt.files;

            if (files.length > 0) {
                const file = files[0];

                // Validate file size (5 MB limit)
                const maxSize = 5 * 1024 * 1024;
                if (file.size > maxSize) {
                    alert('Image size must be less than 5 MB. Please resize or compress your image.');
                    return;
                }

                // Validate file type
                const validTypes = ['image/png', 'image/jpeg', 'image/jpg', 'image/webp'];
                if (!validTypes.includes(file.type)) {
                    alert('Only PNG, JPEG, and WebP formats are supported.');
                    return;
                }

                const reader = new FileReader();
                reader.onload = function(readerEvent) {
                    const img = new Image();
                    img.onload = function() {
                        // Validate dimensions
                        const minDim = 320;
                        const maxDim = 2048;

                        if (img.width < minDim || img.height < minDim) {
                            alert(`Image dimensions must be at least ${minDim}x${minDim} pixels. Current: ${img.width}x${img.height}`);
                            return;
                        }

                        if (img.width > maxDim || img.height > maxDim) {
                            alert(`Image dimensions must not exceed ${maxDim}x${maxDim} pixels. Current: ${img.width}x${img.height}`);
                            return;
                        }

                        // All validations passed
                        const preview = document.getElementById('preview');
                        preview.src = readerEvent.target.result;
                        preview.style.display = 'block';
                        uploadedImageBase64 = readerEvent.target.result.split(',')[1];
                    };
                    img.src = readerEvent.target.result;
                };
                reader.readAsDataURL(file);
            }
        }

        function previewImage(event) {
            const file = event.target.files[0];
            if (file) {
                // Validate file size (5 MB limit)
                const maxSize = 5 * 1024 * 1024; // 5 MB
                if (file.size > maxSize) {
                    alert('Image size must be less than 5 MB. Please resize or compress your image.');
                    event.target.value = '';
                    return;
                }

                // Validate file type
                const validTypes = ['image/png', 'image/jpeg', 'image/jpg', 'image/webp'];
                if (!validTypes.includes(file.type)) {
                    alert('Only PNG, JPEG, and WebP formats are supported.');
                    event.target.value = '';
                    return;
                }

                const reader = new FileReader();
                reader.onload = function(e) {
                    const img = new Image();
                    img.onload = function() {
                        // Validate dimensions
                        const minDim = 320;
                        const maxDim = 2048;

                        if (img.width < minDim || img.height < minDim) {
                            alert(`Image dimensions must be at least ${minDim}x${minDim} pixels. Current: ${img.width}x${img.height}`);
                            event.target.value = '';
                            return;
                        }

                        if (img.width > maxDim || img.height > maxDim) {
                            alert(`Image dimensions must not exceed ${maxDim}x${maxDim} pixels. Current: ${img.width}x${img.height}`);
                            event.target.value = '';
                            return;
                        }

                        // All validations passed
                        const preview = document.getElementById('preview');
                        preview.src = e.target.result;
                        preview.style.display = 'block';
                        uploadedImageBase64 = e.target.result.split(',')[1];
                    };
                    img.src = e.target.result;
                };
                reader.readAsDataURL(file);
            }
        }

        function validateDimensions(width, height, modelId) {
            // Basic validation
            if (width < 512 || height < 512) {
                return `Image dimensions too small: ${width}x${height}. Minimum size is 512x512 pixels.`;
            }
            if (width > 2048 || height > 2048) {
                return `Image dimensions too large: ${width}x${height}. Maximum size is 2048x2048 pixels.`;
            }

            // SDXL supports any dimensions, no aspect ratio validation needed
            if (false && modelId.includes('stable-diffusion')) {
                const ratio = (width / height).toFixed(2);
                const validRatios = {
                    '1.00': '1:1 (e.g., 1024x1024)',
                    '1.78': '16:9 (e.g., 1344x768)',
                    '2.40': '21:9 (e.g., 1536x640)',
                    '0.67': '2:3 (e.g., 832x1216)',
                    '1.46': '3:2 (e.g., 1216x832)',
                    '0.80': '4:5 (e.g., 896x1120)',
                    '1.25': '5:4 (e.g., 1120x896)',
                    '0.56': '9:16 (e.g., 768x1344)',
                    '0.42': '9:21 (e.g., 640x1536)'
                };

                const isValid = Object.keys(validRatios).some(validRatio => 
                    Math.abs(parseFloat(ratio) - parseFloat(validRatio)) < 0.05
                );

                if (!isValid) {
                    const supported = Object.values(validRatios).join(', ');
                    return `Stable Diffusion 3.5 requires standard aspect ratios.\n\nYour dimensions: ${width}x${height} (ratio ${ratio})\n\nSupported ratios:\n${supported}`;
                }
            }

            return null; // Valid
        }

        async function generateTextToImage(event) {
            event.preventDefault();

            const apiEndpoint = document.getElementById('api-endpoint-text').value;
            const apiKey = document.getElementById('api-key-text').value;
            const modelId = document.getElementById('model-text').value;
            const prompt = document.getElementById('prompt-text').value;
            const negativePrompt = document.getElementById('negative-prompt-text').value;
            const width = parseInt(document.getElementById('width-text').value);
            const height = parseInt(document.getElementById('height-text').value);
            const resultDiv = document.getElementById('result-text');
            const button = document.getElementById('btn-text');

            // Client-side validation
            const validationError = validateDimensions(width, height, modelId);
            if (validationError) {
                alert(validationError);
                return;
            }

            // Save API endpoint and key for future use
            saveApiEndpoint(apiEndpoint);
            saveApiKey(apiKey);

            button.disabled = true;
            resultDiv.innerHTML = '<div class="status loading">Generating image... This may take up to 60 seconds.</div><div class="loader"></div>';

            try {
                // Create abort controller for timeout
                const controller = new AbortController();
                const timeoutId = setTimeout(() => controller.abort(), 120000); // 2 minute timeout

                // Build headers
                const headers = {
                    'Content-Type': 'application/json',
                };
                if (apiKey) {
                    headers['X-API-Key'] = apiKey;
                }

                const response = await fetch(apiEndpoint, {
                    method: 'POST',
                    headers: headers,
                    body: JSON.stringify({
                        model_id: modelId,
                        prompt: prompt,
                        negative_prompt: negativePrompt,
                        width: width,
                        height: height
                    }),
                    signal: controller.signal
                });

                clearTimeout(timeoutId);

                if (!response.ok) {
                    throw new Error(`HTTP ${response.status}: ${response.statusText}`);
                }

                const data = await response.json();

                if (data.success) {
                    // Save to history
                    saveToHistory(data.image_url, data.filename, prompt, modelId);

                    resultDiv.innerHTML = `
                        <div class="status success">${data.message}</div>
                        ${data.filename ? `<p style="font-size: 14px; color: #666; margin-top: 10px;"><strong>Filename:</strong> ${data.filename}</p>` : ''}
                        <img src="${data.image_url}" class="result-image" alt="Generated image">
                        <p style="margin-top: 10px; font-size: 14px; color: #666;">
                            <a href="${data.image_url}" download="${data.filename || 'generated-image.png'}" style="color: #10b981;">Download Image</a>
                        </p>
                    `;
                } else {
                    resultDiv.innerHTML = `<div class="status error">Error: ${data.error || 'Unknown error'}</div>`;
                }
            } catch (error) {
                if (error.name === 'AbortError') {
                    resultDiv.innerHTML = `<div class="status error">⏱️ Request timeout. Image generation took too long. Please try again.</div>`;
                } else {
                    resultDiv.innerHTML = `
                        <div class="status error">
                            ❌ <strong>Error:</strong> ${error.message}
                            <br><small style="display: block; margin-top: 10px;">
                                💡 <strong>Troubleshooting:</strong><br>
                                • Check that your API endpoint is correct<br>
                                • Ensure CORS is enabled in API Gateway<br>
                                • Verify Lambda has Bedrock permissions<br>
                                • Check CloudWatch logs for details
                            </small>
                        </div>
                    `;
                }
            } finally {
                button.disabled = false;
            }
        }

        async function generateImageToImage(event) {
            event.preventDefault();

            if (!uploadedImageBase64) {
                alert('Please upload an image first');
                return;
            }

            const apiEndpoint = document.getElementById('api-endpoint-image').value;
            const apiKey = document.getElementById('api-key-image').value;
            const modelId = document.getElementById('model-image').value;
            const prompt = document.getElementById('prompt-image').value;
            const negativePrompt = document.getElementById('negative-prompt-image').value;
            const width = parseInt(document.getElementById('width-image').value);
            const height = parseInt(document.getElementById('height-image').value);
            const resultDiv = document.getElementById('result-image');
            const button = document.getElementById('btn-image');

            // Client-side validation (image-to-image has min 320x320)
            if (width < 320 || height < 320) {
                alert(`Image-to-image dimensions too small: ${width}x${height}. Minimum size is 320x320 pixels.`);
                return;
            }
            if (width > 2048 || height > 2048) {
                alert(`Image dimensions too large: ${width}x${height}. Maximum size is 2048x2048 pixels.`);
                return;
            }

            // Save API endpoint and key for future use
            saveApiEndpoint(apiEndpoint);
            saveApiKey(apiKey);

            button.disabled = true;
            resultDiv.innerHTML = '<div class="status loading">Transforming image... This may take up to 60 seconds.</div><div class="loader"></div>';

            try {
                // Create abort controller for timeout
                const controller = new AbortController();
                const timeoutId = setTimeout(() => controller.abort(), 120000); // 2 minute timeout

                // Build headers
                const headers = {
                    'Content-Type': 'application/json',
                };
                if (apiKey) {
                    headers['X-API-Key'] = apiKey;
                }

                const response = await fetch(apiEndpoint, {
                    method: 'POST',
                    headers: headers,
                    body: JSON.stringify({
                        model_id: modelId,
                        prompt: prompt,
                        negative_prompt: negativePrompt,
                        input_image: uploadedImageBase64,
                        width: width,
                        height: height
                    }),
                    signal: controller.signal
                });

                clearTimeout(timeoutId);

                if (!response.ok) {
                    throw new Error(`HTTP ${response.status}: ${response.statusText}`);
                }

                const data = await response.json();

                if (data.success) {
                    resultDiv.innerHTML = `
                        <div class="status success">${data.message}</div>
                        ${data.filename ? `<p style="font-size: 14px; color: #666; margin-top: 10px;"><strong>Filename:</strong> ${data.filename}</p>` : ''}
                        <img src="${data.image_url}" class="result-image" alt="Generated image">
                        <p style="margin-top: 10px; font-size: 14px; color: #666;">
                            <a href="${data.image_url}" download="${data.filename || 'generated-image.png'}" style="color: #10b981;">Download Image</a>
                        </p>
                    `;
                } else {
                    resultDiv.innerHTML = `<div class="status error">Error: ${data.error || 'Unknown error'}</div>`;
                }
            } catch (error) {
                if (error.name === 'AbortError') {
                    resultDiv.innerHTML = `<div class="status error">Request timeout. Image generation took too long. Please try again.</div>`;
                } else {
                    resultDiv.innerHTML = `<div class="status error">Error: ${error.message}<br><small>Check that your API endpoint is correct and CORS is enabled.</small></div>`;
                }
            } finally {
                button.disabled = false;
            }
        }
    </script>
</body>
</html>

error.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Error - AI Image Generator</title>
    <style>
        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }

        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
            background: linear-gradient(135deg, #10b981 0%, #059669 100%);
            min-height: 100vh;
            display: flex;
            align-items: center;
            justify-content: center;
            padding: 20px;
        }

        .error-container {
            background: white;
            border-radius: 12px;
            padding: 60px 40px;
            box-shadow: 0 10px 40px rgba(0,0,0,0.1);
            text-align: center;
            max-width: 600px;
            width: 100%;
        }

        .error-icon {
            font-size: 80px;
            margin-bottom: 20px;
        }

        h1 {
            color: #333;
            font-size: 2.5rem;
            margin-bottom: 15px;
        }

        .error-code {
            color: #10b981;
            font-size: 1.2rem;
            font-weight: 600;
            margin-bottom: 20px;
        }

        p {
            color: #666;
            font-size: 1.1rem;
            line-height: 1.6;
            margin-bottom: 30px;
        }

        .button-group {
            display: flex;
            gap: 15px;
            justify-content: center;
            flex-wrap: wrap;
        }

        .btn {
            padding: 15px 30px;
            border: none;
            border-radius: 8px;
            font-size: 16px;
            font-weight: 600;
            cursor: pointer;
            text-decoration: none;
            transition: all 0.3s;
            display: inline-block;
        }

        .btn-primary {
            background: #10b981;
            color: white;
        }

        .btn-primary:hover {
            background: #059669;
            transform: translateY(-2px);
            box-shadow: 0 5px 15px rgba(16, 185, 129, 0.4);
        }

        .btn-secondary {
            background: #f5f5f5;
            color: #333;
        }

        .btn-secondary:hover {
            background: #e0e0e0;
        }

        .error-details {
            margin-top: 30px;
            padding: 20px;
            background: #f9f9f9;
            border-radius: 8px;
            text-align: left;
        }

        .error-details h3 {
            color: #333;
            font-size: 1rem;
            margin-bottom: 10px;
        }

        .error-details ul {
            list-style: none;
            padding: 0;
        }

        .error-details li {
            color: #666;
            font-size: 0.9rem;
            padding: 5px 0;
            padding-left: 20px;
            position: relative;
        }

        .error-details li:before {
            content: "•";
            position: absolute;
            left: 0;
            color: #10b981;
            font-weight: bold;
        }

        @media (max-width: 600px) {
            .error-container {
                padding: 40px 20px;
            }

            h1 {
                font-size: 2rem;
            }

            .error-icon {
                font-size: 60px;
            }

            .button-group {
                flex-direction: column;
            }

            .btn {
                width: 100%;
            }
        }
    </style>
</head>
<body>
    <div class="error-container">
        <div class="error-icon">⚠️</div>
        <h1>Oops! Something went wrong</h1>
        <p class="error-code">Error 404 - Page Not Found</p>
        <p>The page you're looking for doesn't exist or has been moved.</p>

        <div class="button-group">
            <a href="/" class="btn btn-primary">Go to Home</a>
            <button onclick="history.back()" class="btn btn-secondary">Go Back</button>
        </div>

        <div class="error-details">
            <h3>Common Issues:</h3>
            <ul>
                <li>The URL might be mistyped</li>
                <li>The page may have been removed or renamed</li>
                <li>Your session might have expired</li>
                <li>The resource you're looking for is not available</li>
            </ul>
        </div>
    </div>

    <script>
        // Get error code from URL if available
        const urlParams = new URLSearchParams(window.location.search);
        const errorCode = urlParams.get('code');

        if (errorCode) {
            const errorCodeElement = document.querySelector('.error-code');
            const errorMessages = {
                '403': 'Error 403 - Access Forbidden',
                '404': 'Error 404 - Page Not Found',
                '500': 'Error 500 - Internal Server Error',
                '503': 'Error 503 - Service Unavailable'
            };

            if (errorMessages[errorCode]) {
                errorCodeElement.textContent = errorMessages[errorCode];
            }
        }
    </script>
</body>
</html>

2.2. S3 buckets

Create two buckets:

Frontend bucket: hosts the website
Image Storage bucket: stores generated images
- Private bucket
- Create lifecycle policies to delete images after 7 days to save cost

import * as cdk from 'aws-cdk-lib';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as s3deploy from 'aws-cdk-lib/aws-s3-deployment';
import { Construct } from 'constructs';
import * as path from 'path';

export interface S3BucketsStackProps extends cdk.StackProps {
  stackName: string;
  region: string;
  accountId: string;
  envName: string;
  imagesBucketName: string;
  frontendBucketName: string;
  imageExpiration: number;
}

export class S3BucketsStack extends cdk.Stack {
  public readonly imagesBucket: s3.Bucket;
  public readonly frontendBucket: s3.Bucket;

  constructor(scope: Construct, id: string, props: S3BucketsStackProps) {
    const { region, accountId, envName } = props;
    const updatedProps = {
      env: {
        region: region,
        account: accountId,
      },
      ...props,
    };
    super(scope, id, updatedProps);

    // Determine removal policy based on environment
    const isProduction = envName.toLowerCase() === 'prod';
    const imagesRemovalPolicy = isProduction
      ? cdk.RemovalPolicy.RETAIN
      : cdk.RemovalPolicy.DESTROY;
    const autoDeleteImages = !isProduction;

    // Private S3 bucket for generated images
    this.imagesBucket = new s3.Bucket(this, 'ImagesBucket', {
      bucketName: props.imagesBucketName,
      encryption: s3.BucketEncryption.S3_MANAGED,
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
      versioned: true,
      enforceSSL: true,
      removalPolicy: imagesRemovalPolicy,
      autoDeleteObjects: autoDeleteImages,
      lifecycleRules: [
        {
          // Expire current versions
          expiration: cdk.Duration.days(props.imageExpiration),
          noncurrentVersionExpiration: cdk.Duration.days(props.imageExpiration),
          enabled: true,
        },
        {
          // Remove expired object delete markers
          expiredObjectDeleteMarker: true,
          enabled: true,
        },
      ],
    });

    // Frontend S3 bucket with website hosting
    this.frontendBucket = new s3.Bucket(this, 'FrontendBucket', {
      bucketName: props.frontendBucketName,
      websiteIndexDocument: 'index.html',
      websiteErrorDocument: 'error.html',
      publicReadAccess: true,
      blockPublicAccess: new s3.BlockPublicAccess({
        blockPublicAcls: false,
        blockPublicPolicy: false,
        ignorePublicAcls: false,
        restrictPublicBuckets: false,
      }),
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      autoDeleteObjects: true,
    });

    // Deploy frontend files
    new s3deploy.BucketDeployment(this, 'DeployFrontend', {
      sources: [s3deploy.Source.asset(path.join(__dirname, '../frontend'))],
      destinationBucket: this.frontendBucket,
    });

    // Outputs
    new cdk.CfnOutput(this, 'ImagesBucketName', {
      value: this.imagesBucket.bucketName,
      description: 'S3 bucket for generated images',
      exportName: `ImagesBucketName-${envName}`,
    });

    new cdk.CfnOutput(this, 'ImagesBucketArn', {
      value: this.imagesBucket.bucketArn,
      description: 'S3 bucket ARN for generated images',
      exportName: `ImagesBucketArn-${envName}`,
    });

    new cdk.CfnOutput(this, 'FrontendBucketName', {
      value: this.frontendBucket.bucketName,
      description: 'S3 bucket for frontend',
      exportName: `FrontendBucketName-${envName}`,
    });

    new cdk.CfnOutput(this, 'FrontendUrl', {
      value: this.frontendBucket.bucketWebsiteUrl,
      description: 'Frontend website URL',
      exportName: `FrontendUrl-${envName}`,
    });
  }
}

2.3 Lambda function

Create lambda handler:

"""
AWS Lambda function for AI image generation using Amazon Bedrock.

This function handles both text-to-image and image-to-image generation
using Amazon Titan Image Generator v2 and Stable Diffusion 3.5 Large models.

Features:
- Text-to-image generation from prompts
- Image-to-image transformation
- Negative prompts support
- Style presets
- Automatic dimension validation
- CloudWatch metrics and structured logging
- S3 storage with presigned URLs

Environment Variables:
- BUCKET_NAME: S3 bucket for storing generated images
- MODEL_ID: Default Bedrock model ID (fallback)
- ENVIRONMENT: Deployment environment (dev/prod)

Author: KateVu
Repository: https://github.com/KateVu/aws-cdk-genai-image
"""

import json
import boto3
import os
import base64
import uuid
import logging
from datetime import datetime
from time import time

# Configure structured logging for CloudWatch
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Initialize AWS service clients
bedrock_runtime = boto3.client('bedrock-runtime')  # For AI image generation
s3_client = boto3.client('s3')  # For image storage
cloudwatch = boto3.client('cloudwatch')  # For custom metrics

# Load environment variables
BUCKET_NAME = os.environ['BUCKET_NAME']
MODEL_ID = os.environ['MODEL_ID']
ENVIRONMENT = os.environ.get('ENVIRONMENT', 'dev')

def handler(event, context):
    """
    Main Lambda handler for image generation requests.

    Processes API Gateway requests to generate images using AWS Bedrock models.
    Supports both text-to-image and image-to-image generation with optional
    negative prompts and style presets.

    Args:
        event: API Gateway event containing request body with:
            - prompt (str, required): Text description of desired image
            - negative_prompt (str, optional): What to avoid in the image
            - model_id (str, optional): Bedrock model ID to use
            - width (int, optional): Image width (512-2048, default 1024)
            - height (int, optional): Image height (512-2048, default 1024)
            - input_image (str, optional): Base64 encoded image for transformation
        context: Lambda context object with request metadata

    Returns:
        dict: API Gateway response with:
            - statusCode: HTTP status code (200, 400, or 500)
            - headers: CORS headers
            - body: JSON with success status, image_url, filename, or error

    Raises:
        ValueError: For validation errors (returns 400)
        Exception: For server errors (returns 500)
    """
    start_time = time()
    request_id = context.aws_request_id

    try:
        # Parse request body
        body = json.loads(event.get('body', '{}'))

        prompt = body.get('prompt', '')
        negative_prompt = body.get('negative_prompt', '')
        input_image = body.get('input_image')
        width = body.get('width', 1024)
        height = body.get('height', 1024)
        model_id = body.get('model_id', MODEL_ID)
        generation_type = 'image-to-image' if input_image else 'text-to-image'

        logger.info(
            "Processing image generation request",
            extra={
                'request_id': request_id,
                'model_id': model_id,
                'generation_type': generation_type,
                'dimensions': f"{width}x{height}",
                'prompt_length': len(prompt)
            }
        )

        if not prompt:
            logger.warning("Request rejected: missing prompt",
                           extra={'request_id': request_id})
            publish_metric('ValidationError', 1, model_id, generation_type)
            return response(400, {'error': 'Prompt is required'})

        # Validate dimensions
        try:
            validate_dimensions(width, height, model_id, generation_type)
        except ValueError as e:
            logger.warning(
                f"Dimension validation failed: {str(e)}",
                extra={'request_id': request_id}
            )
            publish_metric('ValidationError', 1, model_id, generation_type)
            return response(400, {'error': str(e)})

        # Generate image
        if input_image:
            image_data = generate_image_to_image(
                input_image, prompt, negative_prompt, width, height, model_id
            )
        else:
            image_data = generate_text_to_image(
                prompt, negative_prompt, width, height, model_id
            )

        # Save to S3
        image_url, filename = save_to_s3(image_data, prompt)

        # Calculate duration and publish metrics
        duration = time() - start_time
        logger.info(
            "Image generated successfully",
            extra={
                'request_id': request_id,
                'model_id': model_id,
                'generation_type': generation_type,
                'image_filename': filename,
                'duration_seconds': round(duration, 2)
            }
        )

        publish_metric('GenerationSuccess', 1, model_id, generation_type)
        publish_metric('GenerationDuration', duration, model_id,
                       generation_type, unit='Seconds')

        return response(200, {
            'success': True,
            'image_url': image_url,
            'filename': filename,
            'message': 'Image generated successfully'
        })

    except ValueError as e:
        # Validation errors - return 400
        duration = time() - start_time
        logger.warning(
            f"Validation error: {str(e)}",
            extra={
                'request_id': request_id,
                'duration_seconds': round(duration, 2)
            }
        )
        publish_metric('ValidationError', 1,
                       body.get('model_id', MODEL_ID),
                       'image-to-image' if body.get('input_image')
                       else 'text-to-image')
        return response(400, {'error': str(e)})

    except Exception as e:
        # Server errors - return 500
        duration = time() - start_time
        error_type = type(e).__name__

        logger.error(
            "Image generation failed",
            extra={
                'request_id': request_id,
                'error_type': error_type,
                'error_message': str(e),
                'duration_seconds': round(duration, 2)
            },
            exc_info=True
        )

        publish_metric('GenerationError', 1,
                       body.get('model_id', MODEL_ID),
                       'image-to-image' if body.get('input_image')
                       else 'text-to-image')

        return response(500, {'error': str(e)})

def validate_dimensions(width, height, model_id, generation_type):
    """
    Validate image dimensions before calling Bedrock API.

    Ensures dimensions meet model requirements and prevents API errors.

    Args:
        width (int): Desired image width in pixels
        height (int): Desired image height in pixels
        model_id (str): Bedrock model identifier
        generation_type (str): 'text-to-image' or 'image-to-image'

    Raises:
        ValueError: If dimensions are invalid with descriptive error message

    Validation Rules:
        - Both dimensions must be integers
        - Minimum: 512x512 pixels (320x320 for Titan image-to-image)
        - Maximum: 2048x2048 pixels
        - SD3 requires specific aspect ratios (handled by get_aspect_ratio)
    """
    # Basic dimension validation
    if not isinstance(width, int) or not isinstance(height, int):
        raise ValueError(
            "Width and height must be integers. "
            f"Received: width={width}, height={height}"
        )

    if width < 512 or height < 512:
        raise ValueError(
            f"Image dimensions too small: {width}x{height}. "
            "Minimum size is 512x512 pixels."
        )

    if width > 2048 or height > 2048:
        raise ValueError(
            f"Image dimensions too large: {width}x{height}. "
            "Maximum size is 2048x2048 pixels."
        )

    # Stable Diffusion XL supports flexible dimensions (no aspect ratio restrictions)

    # Image-to-image specific validation for Titan
    if generation_type == 'image-to-image' and 'titan' in model_id.lower():
        if width < 320 or height < 320:
            raise ValueError(
                f"Image-to-image dimensions too small: {width}x{height}. "
                "Minimum size for image-to-image is 320x320 pixels."
            )

    logger.info(f"Dimension validation passed: {width}x{height}")

def publish_metric(metric_name, value, model_id, generation_type,
                   unit='Count'):
    """
    Publish custom CloudWatch metric for monitoring.

    Tracks generation success/failure rates, duration, and validation errors
    with dimensions for filtering by environment, model, and generation type.

    Args:
        metric_name (str): Name of the metric (e.g., 'GenerationSuccess')
        value (float): Metric value to publish
        model_id (str): Bedrock model identifier
        generation_type (str): 'text-to-image' or 'image-to-image'
        unit (str): CloudWatch unit (default: 'Count', also 'Seconds')

    Metrics Published:
        - GenerationSuccess: Count of successful generations
        - GenerationError: Count of failed generations
        - GenerationDuration: Time taken in seconds
        - ValidationError: Count of validation failures
    """
    try:
        cloudwatch.put_metric_data(
            Namespace='ImageGenerator',
            MetricData=[
                {
                    'MetricName': metric_name,
                    'Value': value,
                    'Unit': unit,
                    'Dimensions': [
                        {'Name': 'Environment', 'Value': ENVIRONMENT},
                        {'Name': 'ModelId', 'Value': model_id},
                        {'Name': 'GenerationType', 'Value': generation_type}
                    ]
                }
            ]
        )
    except Exception as e:
        logger.warning(f"Failed to publish metric: {str(e)}")

def generate_text_to_image(prompt, negative_prompt, width, height, model_id):
    """
    Generate image from text prompt using AWS Bedrock.

    Supports multiple models with automatic API format detection:
    - Amazon Titan Image Generator v2: Uses taskType and imageGenerationConfig
    - Stable Diffusion 3.5 Large: Uses prompt and aspect_ratio
    - Legacy SDXL: Uses text_prompts array

    Args:
        prompt (str): Text description of desired image
        negative_prompt (str): What to avoid in the image (optional)
        width (int): Image width in pixels
        height (int): Image height in pixels
        model_id (str): Bedrock model identifier

    Returns:
        bytes: Generated image data in PNG format

    Raises:
        Exception: If Bedrock API call fails or image generation fails
    """
    logger.info(f"Generating text-to-image with model: {model_id}")

    # Determine model type and format request accordingly
    if 'titan' in model_id.lower():
        # Amazon Titan Image Generator format
        text_params = {"text": prompt}
        if negative_prompt:
            text_params["negativeText"] = negative_prompt

        request_body = {
            "taskType": "TEXT_IMAGE",
            "textToImageParams": text_params,
            "imageGenerationConfig": {
                "numberOfImages": 1,
                "width": width,
                "height": height,
                "cfgScale": 8.0
            }
        }
    elif 'sd3' in model_id.lower():
        # Stable Diffusion 3.5 Large format
        aspect_ratio = get_aspect_ratio(width, height)
        request_body = {
            "prompt": prompt,
            "aspect_ratio": aspect_ratio,
            "seed": 0,
            "output_format": "png"
        }
        if negative_prompt:
            request_body["negative_prompt"] = negative_prompt
    else:
        # Legacy Stable Diffusion format (SDXL)
        request_body = {
            "text_prompts": [{"text": prompt}],
            "cfg_scale": 10,
            "seed": 0,
            "steps": 50,
            "width": width,
            "height": height
        }

    try:
        bedrock_response = bedrock_runtime.invoke_model(
            modelId=model_id,
            body=json.dumps(request_body)
        )
        response_body = json.loads(bedrock_response['body'].read())
        logger.info(f"Bedrock text-to-image API call successful: {model_id}")
    except Exception as e:
        logger.error(
            f"Bedrock text-to-image failed for {model_id}: {str(e)}",
            exc_info=True
        )
        raise

    # Extract image based on model type
    if 'titan' in model_id.lower():
        image_b64 = response_body['images'][0]
    elif 'sd3' in model_id.lower():
        image_b64 = response_body['images'][0]
    else:
        # Legacy SDXL format
        if response_body.get('result') == 'success':
            image_b64 = response_body['artifacts'][0]['base64']
        else:
            raise Exception("Image generation failed")

    return base64.b64decode(image_b64)

def get_aspect_ratio(width, height):
    """
    Convert width/height to closest supported aspect ratio for SD3.

    SD3 models require specific aspect ratios. This function finds the
    closest supported ratio to the requested dimensions.

    Args:
        width (int): Desired image width
        height (int): Desired image height

    Returns:
        str: Aspect ratio string (e.g., "1:1", "16:9", "9:16")

    Supported Ratios:
        1:1 (1024x1024), 16:9 (1344x768), 21:9 (1536x640),
        2:3 (832x1216), 3:2 (1216x832), 4:5 (896x1120),
        5:4 (1120x896), 9:16 (768x1344), 9:21 (640x1536)
    """
    ratio = width / height

    # Map to closest supported aspect ratio
    aspect_ratios = {
        1.0: "1:1",      # 1024x1024
        1.75: "16:9",    # 1344x768
        2.4: "21:9",     # 1536x640
        0.67: "2:3",     # 832x1216
        1.46: "3:2",     # 1216x832
        0.8: "4:5",      # 896x1120
        1.25: "5:4",     # 1120x896
        0.57: "9:16",    # 768x1344
        0.42: "9:21"     # 640x1536
    }

    # Find closest ratio
    closest_ratio = min(aspect_ratios.keys(), key=lambda x: abs(x - ratio))
    return aspect_ratios[closest_ratio]

def generate_image_to_image(input_image_b64, prompt, negative_prompt,
                            width, height, model_id):
    """
    Transform an existing image using AWS Bedrock (image-to-image).

    Takes an input image and transforms it according to the prompt while
    preserving some of the original image structure.

    Supported Models:
    - Amazon Titan v2: Uses IMAGE_VARIATION task with strength control
    - Stable Diffusion 3.5 Large: Uses image and strength parameters
    - Legacy SDXL: Uses init_image with image_strength

    Args:
        input_image_b64 (str): Base64 encoded input image
        prompt (str): Text description of desired transformation
        negative_prompt (str): What to avoid in the transformation (optional)
        width (int): Output image width in pixels
        height (int): Output image height in pixels
        model_id (str): Bedrock model identifier

    Returns:
        bytes: Transformed image data in PNG format

    Raises:
        Exception: If Bedrock API call fails or transformation fails

    Note:
        Strength parameter (0.0-1.0) controls how much to transform:
        - 0.0: Keep original image
        - 0.7: Balanced transformation (default for SD3)
        - 1.0: Maximum transformation
    """
    logger.info(f"Generating image-to-image with model: {model_id}")

    # Determine model type and format request accordingly
    if 'sd3' in model_id.lower():
        # Stable Diffusion 3.5 Large format for image-to-image
        request_body = {
            "prompt": prompt,
            "image": input_image_b64,
            "strength": 0.7,  # How much to transform (0.0-1.0)
            "output_format": "png"
        }
        if negative_prompt:
            request_body["negative_prompt"] = negative_prompt
    elif 'titan' in model_id.lower():
        # Amazon Titan Image Generator format
        variation_params = {
            "text": prompt,
            "images": [input_image_b64]
        }
        if negative_prompt:
            variation_params["negativeText"] = negative_prompt

        request_body = {
            "taskType": "IMAGE_VARIATION",
            "imageVariationParams": variation_params,
            "imageGenerationConfig": {
                "numberOfImages": 1,
                "width": width,
                "height": height,
                "cfgScale": 8.0
            }
        }
    else:
        # Legacy Stable Diffusion format (SDXL)
        request_body = {
            "text_prompts": [{"text": prompt}],
            "init_image": input_image_b64,
            "cfg_scale": 10,
            "image_strength": 0.5,
            "seed": 0,
            "steps": 50,
            "width": width,
            "height": height
        }

    try:
        bedrock_response = bedrock_runtime.invoke_model(
            modelId=model_id,
            body=json.dumps(request_body)
        )
        response_body = json.loads(bedrock_response['body'].read())
        logger.info(
            f"Bedrock image-to-image API call successful: {model_id}"
        )
    except Exception as e:
        logger.error(
            f"Bedrock image-to-image failed for {model_id}: {str(e)}",
            exc_info=True
        )
        raise

    # Extract image based on model type
    if 'titan' in model_id.lower():
        image_b64 = response_body['images'][0]
    elif 'sd3' in model_id.lower():
        image_b64 = response_body['images'][0]
    else:
        # Legacy SDXL format
        if response_body.get('result') == 'success':
            image_b64 = response_body['artifacts'][0]['base64']
        else:
            raise Exception("Image generation failed")

    return base64.b64decode(image_b64)

def save_to_s3(image_data, prompt):
    """
    Save generated image to private S3 bucket with encryption.

    Creates a descriptive filename from the prompt and timestamp,
    stores the image with server-side encryption, and generates
    a presigned URL for temporary access.

    Args:
        image_data (bytes): PNG image data to save
        prompt (str): Original prompt used for generation

    Returns:
        tuple: (presigned_url, filename)
            - presigned_url (str): Temporary URL valid for 1 hour
            - filename (str): Generated filename with timestamp and prompt

    Raises:
        Exception: If S3 upload or presigned URL generation fails

    Filename Format:
        YYYYMMDD-HHMMSS_sanitized-prompt_uuid.png
        Example: 20241119-143022_mountain-landscape_a1b2c3d4.png

    S3 Configuration:
        - Server-side encryption: AES256
        - Metadata: prompt (first 100 chars) and generation timestamp
        - Presigned URL expiration: 1 hour (3600 seconds)
    """
    logger.info(f"Saving image to S3 bucket: {BUCKET_NAME}")

    # Generate filename with date and UUID
    now = datetime.utcnow()
    date_str = now.strftime('%Y%m%d-%H%M%S')
    unique_id = str(uuid.uuid4())[:8]

    # Sanitize prompt for filename (first 30 chars, alphanumeric only)
    safe_prompt = ''.join(
        c for c in prompt[:30] if c.isalnum() or c in (' ', '-', '_'))
    safe_prompt = safe_prompt.replace(' ', '-').lower()

    # Create filename: YYYYMMDD-HHMMSS_prompt_uuid.png
    if safe_prompt:
        filename = f"{date_str}_{safe_prompt}_{unique_id}.png"
    else:
        filename = f"{date_str}_{unique_id}.png"

    key = f"generated-images/{filename}"

    try:
        s3_client.put_object(
            Bucket=BUCKET_NAME,
            Key=key,
            Body=image_data,
            ContentType='image/png',
            ServerSideEncryption='AES256',
            Metadata={
                'prompt': prompt[:100],
                'generated_at': now.isoformat()
            }
        )
        logger.info(f"Image saved successfully: {key}")
    except Exception as e:
        logger.error(f"Failed to save image to S3: {str(e)}", exc_info=True)
        raise

    # Generate presigned URL (expires in 1 hour)
    try:
        presigned_url = s3_client.generate_presigned_url(
            'get_object',
            Params={'Bucket': BUCKET_NAME, 'Key': key},
            ExpiresIn=3600
        )
        logger.info("Presigned URL generated successfully")
    except Exception as e:
        logger.error(
            f"Failed to generate presigned URL: {str(e)}",
            exc_info=True
        )
        raise

    return presigned_url, filename

def response(status_code, body):
    """
    Format API Gateway response with CORS headers.

    Args:
        status_code (int): HTTP status code (200, 400, 500)
        body (dict): Response body to be JSON serialized

    Returns:
        dict: API Gateway response format with:
            - statusCode: HTTP status code
            - headers: CORS headers for cross-origin requests
            - body: JSON stringified response body

    CORS Configuration:
        - Access-Control-Allow-Origin: * (all origins)
        - Access-Control-Allow-Methods: POST, OPTIONS
        - Access-Control-Allow-Headers: Content-Type
    """
    return {
        'statusCode': status_code,
        'headers': {
            'Content-Type': 'application/json',
            'Access-Control-Allow-Origin': '*',
            'Access-Control-Allow-Headers': 'Content-Type',
            'Access-Control-Allow-Methods': 'POST, OPTIONS'
        },
        'body': json.dumps(body)

Provision lambda in AWS

// Create IAM role for Lambda function with Bedrock permissions
    const lambdaRole = new iam.Role(this, 'ImageGeneratorLambdaRole', {
      assumedBy: new iam.ServicePrincipal('lambda.amazonaws.com'),
      description:
        'Role for image generator Lambda function to access Bedrock and S3',
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName(
          'service-role/AWSLambdaBasicExecutionRole'
        ),
      ],
    });

    // Add Bedrock invoke permissions for all models
    lambdaRole.addToPolicy(
      new iam.PolicyStatement({
        effect: iam.Effect.ALLOW,
        actions: ['bedrock:InvokeModel'],
        resources: [
          `arn:aws:bedrock:${region}::foundation-model/amazon.titan-image-generator-v2:0`,
          `arn:aws:bedrock:${region}::foundation-model/stability.sd3-5-large-v1:0`,
        ],
      })
    );

    // Add AWS Marketplace permissions for first-time model enablement
    lambdaRole.addToPolicy(
      new iam.PolicyStatement({
        effect: iam.Effect.ALLOW,
        actions: [
          'aws-marketplace:ViewSubscriptions',
          'aws-marketplace:Subscribe',
        ],
        resources: ['*'],
      })
    );

    // Grant S3 permissions
    props.imagesBucket.grantReadWrite(lambdaRole);

    // Grant CloudWatch metrics permissions
    lambdaRole.addToPolicy(
      new iam.PolicyStatement({
        effect: iam.Effect.ALLOW,
        actions: ['cloudwatch:PutMetricData'],
        resources: ['*'],
      })
    );

    // Create CloudWatch Log Group for Lambda
    const logGroup = new logs.LogGroup(this, 'ImageGeneratorLogGroup', {
      logGroupName: `/aws/lambda/${envName}-image-generator`,
      retention: props.logRetentionDays as logs.RetentionDays,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    // Lambda function for image generation with automatic dependency bundling
    this.lambdaFunction = new lambdaPython.PythonFunction(
      this,
      'ImageGeneratorFunction',
      {
        functionName: `${envName}-image-generator`,
        runtime: lambda.Runtime.PYTHON_3_11,
        entry: path.join(__dirname, '../lambda'),
        index: 'index.py',
        handler: 'handler',
        role: lambdaRole,
        timeout: cdk.Duration.seconds(props.lambdaTimeout),
        memorySize: props.lambdaMemorySize,
        environment: {
          BUCKET_NAME: props.imagesBucket.bucketName,
          MODEL_ID: 'amazon.titan-image-generator-v2:0', // Default fallback
          ENVIRONMENT: envName,
        },
        logGroup: logGroup,
        description:
          'Lambda function to generate images using Amazon Bedrock',
      }
    );

2.4 Api gateway

Provision API gateway with:

POST endpoint
CORS allowed
API key for security
Set up throttling to prevent overspending

// Create API Gateway REST API
    this.api = new apigateway.RestApi(this, 'ImageGeneratorApi', {
      restApiName: `${envName}-image-generator-API`,
      description: 'API for AI image generation using AWS Bedrock',
      defaultCorsPreflightOptions: {
        allowOrigins: props.corsAllowOrigins,
        allowMethods: apigateway.Cors.ALL_METHODS,
        allowHeaders: [
          'Content-Type',
          'X-Amz-Date',
          'Authorization',
          'X-Api-Key',
        ],
      },
    });

    // Create Lambda integration
    const imageGeneratorIntegration = new apigateway.LambdaIntegration(
      this.lambdaFunction
    );

    // Add POST method to the API with API key requirement
    const generateResource = this.api.root.addResource('generate');
    generateResource.addMethod('POST', imageGeneratorIntegration, {
      apiKeyRequired: true,
    });

    // Create API Key
    const apiKey = this.api.addApiKey('ImageGeneratorApiKey', {
      apiKeyName: `${envName}-image-generator-key`,
      description: `API key for ${envName} image generator`,
    });

    // Create Usage Plan with rate limiting and quotas
    const usagePlan = this.api.addUsagePlan('ImageGeneratorUsagePlan', {
      name: `${envName}-image-generator-usage-plan`,
      description: 'Usage plan with rate limiting for image generation',
      throttle: {
        rateLimit: 10, // 10 requests per second
        burstLimit: 20, // Allow bursts up to 20 requests
      },
      quota: {
        limit: 10000, // 10,000 requests per month
        period: apigateway.Period.MONTH,
      },
    });

    // Associate API key with usage plan
    usagePlan.addApiKey(apiKey);
    usagePlan.addApiStage({
      stage: this.api.deploymentStage,
    });

2.5 Connect everything in app.ts

import * as cdk from 'aws-cdk-lib';
import { getAccountId, loadConfig } from '../lib/utils';
import { ImageGeneratorStack } from '../lib/image-generator-stack';
import { S3BucketsStack } from '../lib/s3-buckets-stack';

const configFolder = '../config/';
const accountFileName = 'aws_account.yaml';

// Define common tags
const commonTags = {
  createdby: 'KateVu',
  createdvia: 'AWS-CDK',
};

// Function to apply tags to a stack
function applyTags(stack: cdk.Stack, tags: Record<string, string>): void {
  Object.entries(tags).forEach(([key, value]) => {
    cdk.Tags.of(stack).add(key, value);
  });
}

// Set up default value
const envName = process.env.ENVIRONMENT_NAME || 'kate';
const accountName = process.env.ACCOUNT_NAME || 'sandpit2';
const region = process.env.REGION || 'us-west-2'; // us-west-2 has both Titan v2 and SD 3.5 Large available
const aws_account_id = process.env.AWS_ACCOUNT_ID || 'none';

// Get aws account id
let accountId = aws_account_id;
if (aws_account_id == 'none') {
  accountId = getAccountId(accountName, configFolder, accountFileName);
}

// Load configuration
const config = loadConfig(envName);

const app = new cdk.App();

// Define bucket names with region
const imagesBucketName = `${envName}-image-generator-images-${region}`;
const frontendBucketName = `${envName}-image-generator-frontend-${region}`;

const s3BucketsStack = new S3BucketsStack(app, 'S3BucketsStack', {
  stackName: `aws-cdk-image-generator-s3-${envName}`,
  region: region,
  accountId: accountId,
  envName: envName,
  imagesBucketName: imagesBucketName,
  frontendBucketName: frontendBucketName,
  imageExpiration: config.imageExpiration,
});

const imageGeneratorStack = new ImageGeneratorStack(
  app,
  'ImageGeneratorStack',
  {
    stackName: `aws-cdk-image-generator-${envName}`,
    region: region,
    accountId: accountId,
    accountName: accountName,
    envName: envName,
    imagesBucket: s3BucketsStack.imagesBucket,
    lambdaMemorySize: config.lambdaMemorySize,
    lambdaTimeout: config.lambdaTimeout,
    logRetentionDays: config.logRetentionDays,
    corsAllowOrigins: config.corsAllowOrigins,

    /* For more information, see https://docs.aws.amazon.com/cdk/latest/guide/environments.html */
  }
);

imageGeneratorStack.addDependency(s3BucketsStack);

// Apply tags to both stacks
applyTags(s3BucketsStack, {
  ...commonTags,
  environment: envName,
});

applyTags(imageGeneratorStack, {
  ...commonTags,
  environment: envName,
});

2.6 Deploy the app

Ensure valid credentials for the target AWS account
Export environment variable or the app will use the ones have been set in app.ts
Run cdk deploy --all to deploy both stacks

From the output we can get

Frontend link
API Key ID and API Endpoint

Or we can go to CloudFormation console to get them from Output tab

Test your app

Get the frontend link from the output when deploying the app or in CloudFormation console or Go to S3 bucket, check tab Properties to get the bucket S3 endpoint
Update API Endpoint and API Key. To get API Key grab the API Key ID then run the command

aws apigateway get-api-key --api-key <API Key ID> --include-value --query 'value' --output text

Open the link:
- Update API Endpoint and API Key
- Submit prompts, images and verify the outputs

Final Thoughts

So that's it, now you have your very own Image Generator app that you have total control over infrastructure. Other than S3 storage costs, you only pay for the other resource you use.
With some simple tweaks, you can experiment with other models in the future. While we have the option to send negative prompts from the website to Bedrock, for enhanced security and content moderation, you can consider integrating Bedrock Guardrails. For details, refer to Building a Summarizer app using Amazon Bedrock and Bedrock Guardrails using AWS CDK
When done, clean up your resources cdk destroy --all
If you want to retain any S3 bucket, update these parameters when creating the bucket

     removalPolicy: imagesRemovalPolicy,
     autoDeleteObjects: autoDeleteImages,

Reference:

Building a Summarizer app using Amazon Bedrock and Bedrock Guardrails using AWS CDK

Kate Vu — Sat, 15 Nov 2025 10:39:55 +0000

Building your own text summarization app using Amazon Bedrock, and pairing it with Bedrock Guardrails so it doesn’t go rogue!

Architecture

AWS Resources:

S3: Hosts the frontend of the app (html) API Gateway: Provides REST API endpoint to receive content and return summary
AWS Lambda: Processes content and invokes Amazon Bedrock with chosen model and route the request through Bedrock Guardrails the return the output
Amazon Bedrock: Provides access to AI foundation models. For this experiment we are using Claude 3 Haiku for its speed and cost-effectiveness
Amazon Bedrock Guardrails: Implements safeguards customized to your application requirements and responsible AI policies. IAM: Manages permissions for lambda and Amazon Bedrock access

Prerequisites:

AWS Account: you will need it to deploy S3, lambda, API Gateway, Bedrock, and Guardrails
Environment setup: Make sure these are set installed and working
- Note.js
- Typescript
- AWS CDK Toolkit
- Docker: up and running, we will use this to bundle our lambda function
- AWS Credentials: keep them handy so you can deploy the stacks

Deploy

1. Get the model ID

Good news! You do not need to manually grant access to serverless foundation models anymore. They are now automatically enabled for your AWS account.
To get the model ID:

Go to Amazon Bedrock console
Find the model you want.

For this one, we are using Claude 3 Haiku because it's fast and cost-effective. The Model ID looks like this: anthropic.claude-3-haiku-20240307-v1:0. You will need this later when updating your IAM policy, so your lambda function can invoke the model.

Always check the pricing for the model you chose so you don't get any surprise charges

2. Create the resource

2.1 Set up the frontend for our summarizer app.

We will create two html files:

index.html: main interface when the users can paste text and get output
error.html: a simple page if something goes wrong index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Content Summarizer - AI-Powered Text Summary Tool</title>
    <style>
      * {
        margin: 0;
        padding: 0;
        box-sizing: border-box;
      }

      body {
        font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
        background: linear-gradient(135deg, #10b981 0%, #059669 100%);
        min-height: 100vh;
        display: flex;
        justify-content: center;
        align-items: center;
        padding: 20px;
      }

      .container {
        background: white;
        border-radius: 20px;
        box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
        width: 100%;
        max-width: 800px;
        height: 90vh;
        display: flex;
        flex-direction: column;
        overflow: hidden;
      }

      .header {
        background: linear-gradient(135deg, #10b981 0%, #059669 100%);
        color: white;
        padding: 25px;
        text-align: center;
      }

      .header h1 {
        font-size: 28px;
        margin-bottom: 10px;
      }

      .header p {
        opacity: 0.9;
        font-size: 14px;
      }

      .stats {
        display: flex;
        gap: 15px;
        justify-content: center;
        margin-top: 15px;
        flex-wrap: wrap;
      }

      .stat-item {
        background: rgba(255, 255, 255, 0.2);
        padding: 8px 15px;
        border-radius: 20px;
        font-size: 12px;
        font-weight: 500;
      }

      .chat-container {
        flex: 1;
        overflow-y: auto;
        padding: 20px;
        background: #f8f9fa;
      }

      .message {
        margin-bottom: 15px;
        display: flex;
        align-items: flex-start;
        animation: fadeIn 0.3s ease-in;
      }

      @keyframes fadeIn {
        from {
          opacity: 0;
          transform: translateY(10px);
        }
        to {
          opacity: 1;
          transform: translateY(0);
        }
      }

      .message.user {
        justify-content: flex-end;
      }

      .message-content {
        max-width: 70%;
        padding: 12px 16px;
        border-radius: 18px;
        word-wrap: break-word;
        white-space: pre-wrap;
      }

      .message.user .message-content {
        background: linear-gradient(135deg, #10b981 0%, #059669 100%);
        color: white;
        border-bottom-right-radius: 4px;
      }

      .message.bot .message-content {
        background: white;
        color: #333;
        border: 1px solid #e0e0e0;
        border-bottom-left-radius: 4px;
      }

      .message-subject {
        font-size: 11px;
        opacity: 0.7;
        margin-top: 5px;
        font-style: italic;
      }

      .input-container {
        padding: 20px;
        background: white;
        border-top: 1px solid #e0e0e0;
      }

      .input-wrapper {
        display: flex;
        gap: 10px;
        flex-direction: column;
      }

      #contentInput {
        width: 100%;
        padding: 12px 16px;
        border: 2px solid #e0e0e0;
        border-radius: 12px;
        font-size: 14px;
        outline: none;
        transition: border-color 0.3s;
        resize: vertical;
        min-height: 100px;
        font-family: inherit;
        margin-bottom: 10px;
      }

      #contentInput:focus {
        border-color: #10b981;
      }

      #summarizeButton {
        width: 100%;
        padding: 12px 30px;
        background: linear-gradient(135deg, #10b981 0%, #059669 100%);
        color: white;
        border: none;
        border-radius: 25px;
        font-size: 14px;
        font-weight: 600;
        cursor: pointer;
        transition: transform 0.2s, box-shadow 0.2s;
      }

      #summarizeButton:hover:not(:disabled) {
        transform: translateY(-2px);
        box-shadow: 0 5px 15px rgba(16, 185, 129, 0.4);
      }

      #summarizeButton:disabled {
        opacity: 0.6;
        cursor: not-allowed;
      }

      .loading {
        display: inline-block;
        width: 20px;
        height: 20px;
        border: 3px solid rgba(255, 255, 255, 0.3);
        border-radius: 50%;
        border-top-color: white;
        animation: spin 1s ease-in-out infinite;
      }

      @keyframes spin {
        to {
          transform: rotate(360deg);
        }
      }

      .error {
        background: #fee;
        color: #c33;
        padding: 12px 16px;
        border-radius: 8px;
        margin-bottom: 15px;
      }

      .info-box {
        background: #e3f2fd;
        border-left: 4px solid #2196f3;
        padding: 15px;
        margin-bottom: 20px;
        border-radius: 4px;
      }

      .info-box strong {
        color: #1976d2;
      }
    </style>
  </head>
  <body>
    <div class="container">
      <div class="header">
        <h1>📝 Content Summarizer</h1>
        <p>AI-powered text summarization tool</p>
        <div class="stats">
          <span class="stat-item">✨ Instant Summaries</span>
          <span class="stat-item">🎯 Key Points</span>
          <span class="stat-item">⚡ Fast & Accurate</span>
        </div>
      </div>

      <div class="chat-container" id="chatContainer">
        <div class="info-box">
          <strong>Welcome!</strong> Paste any text content below and I'll
          create a clear, concise summary for you. Perfect for articles,
          documents, reports, and more.
        </div>
      </div>

      <div class="input-container">
        <textarea
          id="contentInput"
          placeholder="Paste your content here to summarize..."
        ></textarea>
        <button id="summarizeButton" onclick="summarizeContent()">
          Summarize
        </button>
      </div>
    </div>

    <script>
      // TODO: Replace with your actual API Gateway endpoint after deployment
      const API_ENDPOINT = 'YOUR_API_ENDPOINT_HERE';

      const chatContainer = document.getElementById('chatContainer');
      const contentInput = document.getElementById('contentInput');
      const summarizeButton = document.getElementById('summarizeButton');

      // Allow Ctrl+Enter to summarize
      contentInput.addEventListener('keydown', (e) => {
        if (
          (e.ctrlKey || e.metaKey) &&
          e.key === 'Enter' &&
          !summarizeButton.disabled
        ) {
          summarizeContent();
        }
      });

      function addMessage(text, isUser, metadata = null) {
        const messageDiv = document.createElement('div');
        messageDiv.className = `message ${isUser ? 'user' : 'bot'}`;

        const contentDiv = document.createElement('div');
        contentDiv.className = 'message-content';
        contentDiv.textContent = text;

        if (metadata && !isUser) {
          const metaDiv = document.createElement('div');
          metaDiv.className = 'message-subject';
          metaDiv.textContent = `${metadata.contentType} • ${metadata.wordCount} words`;
          contentDiv.appendChild(metaDiv);
        }

        messageDiv.appendChild(contentDiv);
        chatContainer.appendChild(messageDiv);
        chatContainer.scrollTop = chatContainer.scrollHeight;
      }

      function showError(message) {
        const errorDiv = document.createElement('div');
        errorDiv.className = 'error';
        errorDiv.textContent = `Error: ${message}`;
        chatContainer.appendChild(errorDiv);
        chatContainer.scrollTop = chatContainer.scrollHeight;
      }

      async function summarizeContent() {
        const content = contentInput.value.trim();

        if (!content) {
          return;
        }

        if (API_ENDPOINT === 'YOUR_API_ENDPOINT_HERE') {
          showError(
            'Please update the API_ENDPOINT in the JavaScript code with your deployed API Gateway URL'
          );
          return;
        }

        // Add user message (show full content)
        addMessage(
          `Content to summarize (${content.split(/\s+/).length} words):\n\n${content}`,
          true
        );
        contentInput.value = '';

        // Disable input while processing
        summarizeButton.disabled = true;
        summarizeButton.innerHTML = '<span class="loading"></span>';
        contentInput.disabled = true;

        try {
          const response = await fetch(API_ENDPOINT, {
            method: 'POST',
            headers: {
              'Content-Type': 'application/json',
            },
            body: JSON.stringify({ content }),
          });

          const data = await response.json();

          if (!response.ok) {
            throw new Error(
              data.error || data.details || 'Failed to get summary'
            );
          }

          // Add bot response
          addMessage(data.summary, false, {
            contentType: data.contentType,
            wordCount: data.wordCount,
          });
        } catch (error) {
          console.error('Error:', error);
          showError(
            error.message ||
              'Failed to get summary. Please check your API endpoint and ensure Bedrock is enabled.'
          );
        } finally {
          // Re-enable input
          summarizeButton.disabled = false;
          summarizeButton.textContent = 'Summarize';
          contentInput.disabled = false;
          contentInput.focus();
        }
      }
    </script>
  </body>
</html>

error.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Error - Page Not Found</title>
    <style>
        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }
        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
            background: linear-gradient(135deg, #10b981 0%, #059669 100%);
            min-height: 100vh;
            display: flex;
            align-items: center;
            justify-content: center;
            padding: 20px;
        }
        .error-container {
            background: white;
            border-radius: 20px;
            padding: 60px 40px;
            text-align: center;
            max-width: 600px;
            box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
        }
        .error-code {
            font-size: 120px;
            font-weight: bold;
            color: #10b981;
            line-height: 1;
            margin-bottom: 20px;
        }
        h1 {
            font-size: 32px;
            color: #333;
            margin-bottom: 15px;
        }
        p {
            font-size: 18px;
            color: #666;
            margin-bottom: 30px;
            line-height: 1.6;
        }

    </style>
</head>
<body>
    <div class="error-container">
        <div class="error-code">404</div>
        <h1>Page Not Found</h1>
        <p>Sorry, the page you're looking for doesn't exist or has been moved.</p>
    </div>
</body>
</html>

2.2 Create S3 bucket stack

We create S3 bucket to host our frontend files

import * as cdk from 'aws-cdk-lib';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as s3deploy from 'aws-cdk-lib/aws-s3-deployment';
import { Construct } from 'constructs';

export interface S3BucketsStackProps extends cdk.StackProps {
  bucketNames: string[];
  enableWebsiteHosting?: boolean;
  websiteIndexDocument?: string;
  websiteErrorDocument?: string;
}

export class S3BucketsStack extends cdk.Stack {
  public readonly buckets: Map<string, s3.Bucket> = new Map();
  public websiteBucket?: s3.Bucket;

  constructor(scope: Construct, id: string, props: S3BucketsStackProps) {
    super(scope, id, props);

    const enableWebsite = props.enableWebsiteHosting ?? false;
    const indexDoc = props.websiteIndexDocument ?? 'index.html';
    const errorDoc = props.websiteErrorDocument ?? 'error.html';

    // Create S3 buckets
    props.bucketNames.forEach((bucketName) => {
      const bucketConfig: s3.BucketProps = {
        bucketName: bucketName,
        versioned: false,
        removalPolicy: cdk.RemovalPolicy.DESTROY,
        autoDeleteObjects: true,
        encryption: s3.BucketEncryption.S3_MANAGED,
      };

      // Configure for website hosting if enabled
      if (enableWebsite) {
        Object.assign(bucketConfig, {
          publicReadAccess: true,
          blockPublicAccess: s3.BlockPublicAccess.BLOCK_ACLS,
          websiteIndexDocument: indexDoc,
          websiteErrorDocument: errorDoc,
        });
      } else {
        Object.assign(bucketConfig, {
          publicReadAccess: false,
          blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
        });
      }

      const bucket = new s3.Bucket(this, `${bucketName}-bucket`, bucketConfig);

      this.buckets.set(bucketName, bucket);

      // Store reference to website bucket
      if (enableWebsite && !this.websiteBucket) {
        this.websiteBucket = bucket;
      }

      // Output bucket name
      new cdk.CfnOutput(this, `${bucketName}-BucketName`, {
        value: bucket.bucketName,
        description: `S3 Bucket: ${bucketName}`,
        exportName: `${bucketName}-BucketName`,
      });

      // Output website URL if hosting is enabled
      if (enableWebsite) {
        new cdk.CfnOutput(this, `${bucketName}-WebsiteURL`, {
          value: bucket.bucketWebsiteUrl,
          description: `Website URL for ${bucketName}`,
          exportName: `${bucketName}-WebsiteURL`,
        });
      }
    });
  }

  // Helper method to deploy website content
  public deployWebsite(sourcePath: string, destinationBucket?: s3.Bucket) {
    const targetBucket = destinationBucket ?? this.websiteBucket;

    if (!targetBucket) {
      throw new Error('No website bucket available for deployment');
    }

    return new s3deploy.BucketDeployment(this, 'DeployWebsite', {
      sources: [s3deploy.Source.asset(sourcePath)],
      destinationBucket: targetBucket,
    });
  }
}

2.3 Create Amazon Bedrock Guardrails

Let’s make our AI-app behaves by using Beckrock Guardrails to filter inappropriate content.

// Create Bedrock Guardrail for content filtering
    const guardrail = new bedrock.CfnGuardrail(this, 'SummarizerGuardrail', {
      name: `${envName}-summarizer-guardrail`,
      description:
        'Guardrail for content summarizer to filter inappropriate content',
      blockedInputMessaging:
        'Sorry, I cannot process this content as it contains inappropriate material.',
      blockedOutputsMessaging: 'I apologize, but I cannot provide that summary.',

      // Content policy filters
      contentPolicyConfig: {
        filtersConfig: [
          {
            type: 'SEXUAL',
            inputStrength: 'HIGH',
            outputStrength: 'HIGH',
          },
          {
            type: 'VIOLENCE',
            inputStrength: 'HIGH',
            outputStrength: 'HIGH',
          },
          {
            type: 'HATE',
            inputStrength: 'HIGH',
            outputStrength: 'HIGH',
          },
          {
            type: 'INSULTS',
            inputStrength: 'MEDIUM',
            outputStrength: 'MEDIUM',
          },
          {
            type: 'MISCONDUCT',
            inputStrength: 'MEDIUM',
            outputStrength: 'MEDIUM',
          },
          {
            type: 'PROMPT_ATTACK',
            inputStrength: 'HIGH',
            outputStrength: 'NONE',
          },
        ],
      },

      // Topic policy to filter harmful content
      topicPolicyConfig: {
        topicsConfig: [
          {
            name: 'HarmfulContent',
            definition:
              'Content promoting illegal activities, violence, or harmful behavior',
            examples: [
              'How to make weapons',
              'Instructions for illegal activities',
              'Content promoting self-harm',
            ],
            type: 'DENY',
          },
        ],
      },
    });

2.4 Create lambda function

Create IAM role for lambda, make sure it can invoke the model you pick before

    // Create IAM role for Lambda function with Bedrock permissions
    const lambdaRole = new iam.Role(this, 'SummarizerLambdaRole', {
      assumedBy: new iam.ServicePrincipal('lambda.amazonaws.com'),
      description:
        'Role for content summarizer Lambda function to access Bedrock',
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName(
          'service-role/AWSLambdaBasicExecutionRole'
        ),
      ],
    });

    // Add Bedrock invoke permissions including guardrails
    lambdaRole.addToPolicy(
      new iam.PolicyStatement({
        effect: iam.Effect.ALLOW,
        actions: ['bedrock:InvokeModel', 'bedrock:ApplyGuardrail'],
        resources: [
          `arn:aws:bedrock:${this.region}::foundation-model/anthropic.claude-3-haiku-20240307-v1:0`,
          guardrail.attrGuardrailArn,
        ],
      })
    );

Create lambda

    // Create Lambda function with Python runtime and automatic dependency bundling
    const summarizerFunction = new lambdaPython.PythonFunction(
      this,
      'SummarizerHandler',
      {
        runtime: lambda.Runtime.PYTHON_3_12,
        entry: path.join(__dirname, '../lambda'),
        index: 'summarizer_handler.py',
        handler: 'handler',
        role: lambdaRole,
        timeout: cdk.Duration.seconds(30),
        environment: {
          GUARDRAIL_ID: guardrail.attrGuardrailId,
          GUARDRAIL_VERSION: 'DRAFT',
        },
        description:
          'Lambda function to summarize content using Amazon Bedrock with Guardrails',
      }
    );

Since our Lambda uses Python and dependencies listed in requirements.txt, you need Docker up and running on your machine. Docker is used by the CDK to bundle the dependencies into a deployment package that Lambda can run. For more details, check out aws-cdk/aws-lambda-python-alpha module.

2.5 Create API Gateway

    // Create API Gateway REST API
    const api = new apigateway.RestApi(this, 'SummarizerApi', {
      restApiName: `${envName}-summarizer-API`,
      description: 'API for content summarization',
      defaultCorsPreflightOptions: {
        allowOrigins: apigateway.Cors.ALL_ORIGINS,
        allowMethods: apigateway.Cors.ALL_METHODS,
        allowHeaders: [
          'Content-Type',
          'X-Amz-Date',
          'Authorization',
          'X-Api-Key',
        ],
      },
    });

    // Create Lambda integration
    const summarizerIntegration = new apigateway.LambdaIntegration(
      summarizerFunction
    );

    // Add POST method to the API
    api.root.addMethod('POST', summarizerIntegration);

    // Output the API endpoint URL
    new cdk.CfnOutput(this, 'ApiEndpoint', {
      value: api.url,
      description: 'API Gateway endpoint URL for the content summarizer',
      exportName: 'SummarizerApiEndpoint',
    });

2.6 Lambda function handler

In folder lambda create requirements.txt

boto3>=1.28.0
botocore>=1.31.0

And summarizer_handler.py

import json
import os
import boto3
from botocore.exceptions import ClientError

# Initialize Bedrock client
bedrock_runtime = boto3.client(
    'bedrock-runtime', region_name=os.environ.get('AWS_REGION', 'us-east-1')
)

# System prompt to guide the AI to summarize content
SYSTEM_PROMPT = """You are an expert content summarizer. Your task is to create concise summaries that are significantly shorter than the original text.

Guidelines:
- Reduce the content to at least 20-30% of its original length
- Extract only the main ideas and most important key points
- Remove redundant information and examples
- Use clear, direct language
- Maintain objectivity and accuracy
- Format as a coherent paragraph or bullet points as appropriate

IMPORTANT: Your summary must be substantially shorter than the input. Do not repeat or paraphrase the entire content."""


def detect_content_type(content: str) -> str:
    """Simple content type detection based on length and structure"""
    word_count = len(content.split())

    if word_count < 50:
        return 'Short Text'
    elif word_count < 200:
        return 'Medium Text'
    elif word_count < 500:
        return 'Long Text'
    else:
        return 'Extended Text'


def handler(event, context):
    """Lambda handler function"""
    print(f'Received event: {json.dumps(event)}')

    # Handle CORS preflight
    if event.get('httpMethod') == 'OPTIONS':
        return {
            'statusCode': 200,
            'headers': {
                'Access-Control-Allow-Origin': '*',
                'Access-Control-Allow-Headers': 'Content-Type',
                'Access-Control-Allow-Methods': 'POST, OPTIONS',
            },
            'body': json.dumps({}),
        }

    try:
        # Parse request body
        if isinstance(event.get('body'), str):
            body = json.loads(event['body'])
        else:
            body = event.get('body', {})

        content = body.get('content', '').strip()

        if not content:
            return {
                'statusCode': 400,
                'headers': {
                    'Access-Control-Allow-Origin': '*',
                    'Content-Type': 'application/json',
                },
                'body': json.dumps({'error': 'Content is required'}),
            }

        # Detect content type
        content_type = detect_content_type(content)

        # Construct a concise prompt for Claude
        user_prompt = f"""Summarize the following text in 3-5 clear sentences. Focus only on the main points and key takeaways.

Text:
{content}"""

        # Use Claude 3 Haiku - fast, cost-effective, supports guardrails
        model_id = 'anthropic.claude-3-haiku-20240307-v1:0'

        # Get guardrail configuration from environment
        guardrail_id = os.environ.get('GUARDRAIL_ID')
        guardrail_version = os.environ.get('GUARDRAIL_VERSION', 'DRAFT')

        # Prepare the request body for Claude
        request_body = {
            'anthropic_version': 'bedrock-2023-05-31',
            'max_tokens': 300,
            'temperature': 0.3,
            'messages': [
                {
                    'role': 'user',
                    'content': user_prompt,
                },
            ],
        }

        # Invoke the model with guardrails
        invoke_params = {
            'modelId': model_id,
            'contentType': 'application/json',
            'accept': 'application/json',
            'body': json.dumps(request_body),
        }

        # Add guardrail if configured
        if guardrail_id:
            invoke_params['guardrailIdentifier'] = guardrail_id
            invoke_params['guardrailVersion'] = guardrail_version
            print(
                f'Using guardrail: {guardrail_id} version {guardrail_version}'
            )

        response = bedrock_runtime.invoke_model(**invoke_params)

        # Parse the response for Claude model
        response_body = json.loads(response['body'].read())
        print(f'Bedrock response: {json.dumps(response_body)}')

        # Extract summary from Claude response
        if 'content' in response_body and len(response_body['content']) > 0:
            summary = response_body['content'][0]['text'].strip()
        else:
            raise Exception(f'Unexpected response format: {response_body}')

        chat_response = {
            'summary': summary,
            'contentType': content_type,
            'wordCount': len(content.split()),
        }

        return {
            'statusCode': 200,
            'headers': {
                'Access-Control-Allow-Origin': '*',
                'Content-Type': 'application/json',
            },
            'body': json.dumps(chat_response),
        }

    except ClientError as error:
        error_code = error.response.get('Error', {}).get('Code', '')
        error_message = error.response.get('Error', {}).get(
            'Message', str(error)
        )

        print(f'Error processing request: {error}')

        # Handle guardrail intervention
        if (
            error_code == 'ValidationException'
            and 'guardrail' in error_message.lower()
        ):
            return {
                'statusCode': 400,
                'headers': {
                    'Access-Control-Allow-Origin': '*',
                    'Content-Type': 'application/json',
                },
                'body': json.dumps(
                    {
                        'error': 'Content blocked by guardrail',
                        'message': 'Sorry, I cannot process this content as it may contain inappropriate material.',
                    }
                ),
            }

        # Handle Bedrock access errors
        if error_code == 'AccessDeniedException':
            return {
                'statusCode': 403,
                'headers': {
                    'Access-Control-Allow-Origin': '*',
                    'Content-Type': 'application/json',
                },
                'body': json.dumps(
                    {
                        'error': 'Bedrock access denied. Please ensure Bedrock is enabled in your AWS account and the model is available in your region.',
                        'details': error_message,
                    }
                ),
            }

        return {
            'statusCode': 500,
            'headers': {
                'Access-Control-Allow-Origin': '*',
                'Content-Type': 'application/json',
            },
            'body': json.dumps(
                {'error': 'Internal server error', 'details': error_message}
            ),
        }

    except Exception as error:
        import traceback
        error_trace = traceback.format_exc()
        print(f'Error processing request: {error}')
        print(f'Traceback: {error_trace}')

        return {
            'statusCode': 500,
            'headers': {
                'Access-Control-Allow-Origin': '*',
                'Content-Type': 'application/json',
            },
            'body': json.dumps(
                {'error': 'Internal server error', 'details': str(error)}
            ),
        }

2.6 Update aws-cdk-summarizer.ts

import * as cdk from 'aws-cdk-lib';
import { getAccountId } from '../lib/utils';
import { AwsCdkSummarizerStack } from '../lib/aws-cdk-summarizer-stack';
import { S3BucketsStack } from '../lib/s3-buckets-stack';
const configFolder = '../config/';
const accountFileName = 'aws_account.yaml';

// Define common tags
const commonTags = {
  createdby: 'KateVu',
  createdvia: 'AWS-CDK',
  repo: 'https://github.com/',
};

// Function to apply tags to a stack
function applyTags(stack: cdk.Stack, tags: Record<string, string>): void {
  Object.entries(tags).forEach(([key, value]) => {
    cdk.Tags.of(stack).add(key, value);
  });
}

//Set up default value
const envName = process.env.ENVIRONMENT_NAME || 'kate';
const accountName = process.env.ACCOUNT_NAME || 'sandpit2';
const region = process.env.REGION || 'ap-southeast-2';
const aws_account_id = process.env.AWS_ACCOUNT_ID || 'none';

//Get aws account id
let accountId = aws_account_id;
if (aws_account_id == 'none') {
  accountId = getAccountId(accountName, configFolder, accountFileName);
}

const app = new cdk.App();

const bucketNames = [`${envName}-bedrock-summarizer-app`];

const s3BucketsStack = new S3BucketsStack(app, 'S3BucketsStack', {
  stackName: `aws-cdk-summarizer-s3-${envName}`,
  bucketNames: bucketNames,
  enableWebsiteHosting: true,
  websiteIndexDocument: 'index.html',
  websiteErrorDocument: 'error.html',
  env: {
    account: accountId,
    region: region,
  },
});

// Deploy error.html to the website bucket
s3BucketsStack.deployWebsite('./frontend');

const awsCdkSummarizerStack = new AwsCdkSummarizerStack(
  app,
  'AwsCdkSummarizerStack',
  {
    /* If you don't specify 'env', this stack will be environment-agnostic.
     * Account/Region-dependent features and context lookups will not work,
     * but a single synthesized template can be deployed anywhere. */

    /* Uncomment the next line to specialize this stack for the AWS Account
     * and Region that are implied by the current CLI configuration. */
    // env: { account: process.env.CDK_DEFAULT_ACCOUNT, region: process.env.CDK_DEFAULT_REGION },

    /* Uncomment the next line if you know exactly what Account and Region you
     * want to deploy the stack to. */
    stackName: `aws-cdk-summarizer-${envName}`,
    region: region,
    accountId: accountId,
    accountName: accountName,
    envName: envName,
    /* For more information, see https://docs.aws.amazon.com/cdk/latest/guide/environments.html */
  }
);

awsCdkSummarizerStack.addDependency(s3BucketsStack);

// Apply tags to both stacks
applyTags(s3BucketsStack, {
  ...commonTags,
  environment: envName,
});

applyTags(awsCdkSummarizerStack, {
  ...commonTags,
  environment: envName,
});

2.7 Deploy the app

Ensure valid credential for the target AWS account
Export environment variable or the app will use the ones have been set in aws-cdk-summarizer.ts
Run cdk deploy -- all to deploy both stacks

2.8 Update the API endpoint in index.html file

Update the API endpoint in index.html file with the API endpoint (can see as output when deploying the stacks)

      // TODO: Replace with your actual API Gateway endpoint after deployment
      const API_ENDPOINT = 'YOUR_API_ENDPOINT_HERE';

Re-upload index.html to S3 bucket

Test your app

Go to S3 bucket, check tab Properties to get the bucket S3 endpoint

Go to the website, put some text and check the result

You can tweak the configuration and the user_prompt to in your Lambda function to fine-tune how the summaries are generated

        # Prepare the request body for Claude
        request_body = {
            'anthropic_version': 'bedrock-2023-05-31',
            'max_tokens': 300,
            'temperature': 0.3,
            'messages': [
                {
                    'role': 'user',
                    'content': user_prompt,
                },
            ],
        }        # Prepare the request body for Claude
        request_body = {
            'anthropic_version': 'bedrock-2023-05-31',
            'max_tokens': 300,
            'temperature': 0.3,
            'messages': [
                {
                    'role': 'user',
                    'content': user_prompt,
                },
            ],
        }

Final Thoughts

And that’s it, an AI-powered text summarizer running an Amazon Bedrock, protected by Bedrock Guardrails, and served through S3 bucket.
For this experiment, Kiro has been a great companion, making developing, testing much smoother.
From here you can tweak the prompts to change it to general chatbot, try different models or extend the app to handle multiple languages, …

References

* Amazon Bedrock Documentation

Develop AWS Glue job interactive sessions locally using Jupiter Notebook

Kate Vu — Sat, 08 Nov 2025 10:12:07 +0000

Building and testing AWS Glue scripts locally is more flexible and can speed up your development. In this blog, I walk through how to develop and test AWS Glue jobs locally using Jupiter Notebook
For a full pipeline implementation with AWS CDK, refer to Automatic Trigger Data Pipeline with AWS using AWS CDK
For a full pipeline implementation with AWS CDK, refer to Automatic Trigger Data Pipeline with AWS using AWS CDKFor a full pipeline implementation with AWS CDK, refer to Automatic Trigger Data Pipeline with AWS using AWS CDK

Overview

When developing AWS Glue for Spark jobs, there are several approaches available :

AWS Glue console
Develop and test script locally using Jupiter notebook or Docker image Choosing which approach will depend on your workflow and team setup. However developing locally often provides you more flexibility, faster testing, and easier debugging.

1. AWS Glue console:

To get started, simply open AWS Glue console and navigate to ETL jobs
Press enter or click to view image in full size

For these, you can choose from three options:

Visual editor: if you prefer less code or even no code, visual editor is a great option.
Script editor: Script editor is more straight forward and less setup than AWS Glue Studio notebook for my options. However, when deploying using AWS CDK, I encounter a significant challenge. We ended up manually calculating asset_hash to make sure it is different for each deployment. So if you decide to develop the job via script editor and using s3asset for deployment, make sure to isolate the change to your deployment only.
AWS Glue Studio: provide an interactive development experience via built-in notebook interface. You can develop and test the scripts interactively without running the whole job. When you are satisfied with the scripts, you can save and download it either as .ipynb files or job scripts. With a single click, you can convert it into Glue jobs. In the next session, we will explore how to develop and test the script using Jupiter Notebook.

2. Develop and test scripts locally

When developing AWS Glue for Spark, you can develop and test your scripts locally using Jupiter notebook or Docker before deploying it in AWS.
Below is a step by step walkthrough using Jupiter Notebook.
The Jupyter Notebook is an interactive environment for running code in the browser (Introduction to machine learning, Andreas * Sarah). For more information refer to What is the Jupyter Notebook?

Pricing consideration

Before getting start let check the pricing for our sessions.
AWS Glue Studio Job Notebooks and Interactive Sessions: Suppose you use a notebook in AWS Glue Studio to interactively develop your ETL code. An Interactive Session has 5 DPU by default. The price of 1 DPU-hour is $0.44. If you keep the session running for 24 minutes, you will be billed for 5 DPUs * 0.4 hours * $0.44, or $0.88.” (https://aws.amazon.com/glue/pricing/)
To avoid unnecessary charges, remember to set timeout and terminate sessions when you are done testing.

Setup

Install Jupyter and AWS glue interactive sessions Jupiter kernels

pip3 install - upgrade jupyter boto3 aws-glue-sessionsb
install-glue-kernels

Run jupiter notebook

jupyter notebook

This will launch the interface in your browser.
Alternatively, many IDEs (such as VSCode or Cursor) support Jupyter extensions, allowing you to run notebooks and view outputs directly within the IDE.
Once Jupyter is open, choose Glue PySpark as the kernel.
Press enter or click to view image in full size

Config session credentials and region In the first notebook cell, configure session credential and region, along with other preferences using Jupiter Magic commands. Refer to Configuring AWS Glue interactive sessions for Jupyter and AWS Glue Studio notebooks for more details

# Set region and assume provided role

%idle_timeout 20
%region ap-southeast-2
%iam_role <Replace with GlueIngestionGlueJobRoleARN>
# %additional_python_modules ipython
# %extra_py_files 
# %glue_version 5.0
# %worker_type G.1X
# %number_of_workers 5

# Verify identity
import json, boto3
print(json.dumps(boto3.client("sts").get_caller_identity(), indent=2))

Input arguments For AWS ETL jobs, arguments usually come from geRolsolveOptions. You can stimulate this by defining parameters as below

# Parameters
# Adjust these for your local run; in Glue these come from getResolvedOptions
params = {
    "env_name": "kate",
    "input_bucket": "source-data",
    "output_bucket": "staging-data",
    "error_bucket": "error-staging-data",
    "file_path": "kate/default",
    # Single file for local test from your config
    "file_names": "2023_yellow_taxi_trip_data_light.csv",
    "sns_topic_arn": <sns arn>,
    "JOB_NAME": "local_ingestion_job",
}

params

Include extra py files Upload them to S3 bucket and refer them using %extra py files magic command. However, if you are working on these files, frequency update can be tedious, you will have to re-upload every time you change a function. To make my life a bit easier for these, I include these modules directly within notebook cells and comment out their import lines in the main scripts. For example, include modules for: logging, configuration, and utility functions Logging module

"""
Centralized logging configuration for AWS Data Pipeline
"""

import logging
import logging.config
from typing import Optional, Dict, Any
import json


def setup_logging(
    level: str = "INFO",
    log_format: Optional[str] = None,
    log_file: Optional[str] = None,
    environment: str = "dev",
) -> logging.Logger:
    """
    Set up centralized logging configuration.

    Args:
        level: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
        log_format: Custom log format string
        log_file: Path to log file (optional)
        environment: Environment name (dev, staging, prod)

    Returns:
        Configured logger instance
    """

    # Default format with structured information
    if log_format is None:
        log_format = (
            "%(asctime)s | %(levelname)-8s | %(name)-20s | "
            "%(funcName)-15s:%(lineno)-4d | %(message)s"
        )

    # Create handlers
    handlers = {
        "stdout": {
            "class": "logging.StreamHandler",
            "level": "INFO",
            "formatter": "simple",
            "stream": "ext://sys.stdout",
        },
        "stderr": {
            "class": "logging.StreamHandler",
            "level": "ERROR",
            "formatter": "simple",
            "stream": "ext://sys.stderr",
        },
    }

    # Add file handler if specified
    if log_file:
        handlers["file"] = {
            "class": "logging.handlers.RotatingFileHandler",
            "level": level,
            "formatter": "detailed",
            "filename": log_file,
            "maxBytes": 10485760,  # 10MB
            "backupCount": 5,
            "encoding": "utf8",
        }

    # Logging configuration dictionary
    logging_config = {
        "version": 1,
        "disable_existing_loggers": False,  # Important to keep this False
        "formatters": {
            "detailed": {"format": log_format, "datefmt": "%Y-%m-%d %H:%M:%S"},
            "simple": {"format": "%(name)-20s - %(asctime)s - %(levelname)s - %(message)s"},
        },
        "handlers": handlers,
        "loggers": {
            # Root logger - Parent of all loggers
            "": {
                "level": level,
                "handlers": list(handlers.keys()),
            },
            # Base application logger
            "glue": {
                "level": level,
                "propagate": True,  # Allow propagation to root
            },
            # AWS SDK loggers (reduce noise)
            "boto3": {"level": "WARNING"},
            "botocore": {"level": "WARNING"},
            "urllib3": {"level": "WARNING"},
            # PySpark loggers
            "pyspark": {"level": "WARNING"},
            "py4j": {"level": "WARNING"},
        },
    }

    # Apply configuration
    logging.config.dictConfig(logging_config)

    # Get logger for the calling module
    logger = logging.getLogger(__name__)

    # Log configuration info
    logger.info(f"Logging configured - Level: {level}, Environment: {environment}")

    if log_file:
        logger.info(f"Log file: {log_file}")

    return logger


def get_logger(name: str) -> logging.Logger:
    """
    Get a logger instance for a specific module.

    Args:
        name: Logger name (usually __name__)

    Returns:
        Logger instance
    """
    return logging.getLogger(name)


def log_function_call(logger: logging.Logger, func_name: str, **kwargs):
    """
    Log function entry with parameters.

    Args:
        logger: Logger instance
        func_name: Function name
        **kwargs: Function parameters to log
    """
    params = {
        k: v for k, v in kwargs.items() if k not in ["password", "secret", "token"]
    }
    logger.debug(f"Entering {func_name} with params: {params}")


def log_performance(logger: logging.Logger, operation: str, duration: float, **metrics):
    """
    Log performance metrics.

    Args:
        logger: Logger instance
        operation: Operation name
        duration: Duration in seconds
        **metrics: Additional metrics to log
    """
    logger.info(f"Performance - {operation}: {duration:.3f}s")
    if metrics:
        for key, value in metrics.items():
            logger.info(f"  {key}: {value}")


def log_dataframe_info(logger: logging.Logger, df_name: str, df):
    """
    Log DataFrame information for debugging.

    Args:
        logger: Logger instance
        df_name: DataFrame name/description
        df: PySpark DataFrame
    """
    try:
        row_count = df.count()
        col_count = len(df.columns)
        logger.info(
            f"DataFrame '{df_name}' - Rows: {row_count:,}, Columns: {col_count}"
        )
        logger.debug(f"DataFrame '{df_name}' columns: {df.columns}")
    except Exception as e:
        logger.warning(f"Could not get DataFrame info for '{df_name}': {e}")


def log_s3_operation(
    logger: logging.Logger, operation: str, bucket: str, key: str, **kwargs
):
    """
    Log S3 operations with consistent format.

    Args:
        logger: Logger instance
        operation: S3 operation (read, write, delete, etc.)
        bucket: S3 bucket name
        key: S3 object key
        **kwargs: Additional operation details
    """
    details = f" | {kwargs}" if kwargs else ""
    logger.info(f"S3 {operation.upper()} - s3://{bucket}/{key}{details}")


def log_error_with_context(
    logger: logging.Logger, error: Exception, context: Dict[str, Any]
):
    """
    Log errors with additional context information.

    Args:
        logger: Logger instance
        error: Exception that occurred
        context: Additional context information
    """
    logger.error(f"Error: {str(error)}")
    logger.error(f"Context: {json.dumps(context, indent=2, default=str)}")
    logger.exception("Full traceback:")

Initialise the logger in a single cell and use this logger throughout your notebook. Additional, some extra love to give to logger so we can avoid ValueError: I/O Operation on Closed File

# # Initialize logging and config
setup_logging(level="INFO", environment=params["env_name"])
logger = get_logger('glue.notebook')
# # Give logger some extra love so Jupyter Notebook when working with Pyspark will not hit `ValueError: I/O Operation on Closed File`
#https://stackoverflow.com/questions/31599940/how-to-print-current-logging-configuration-used-by-the-python-logging-module
root = logging.getLogger()
root.handlers[0].stream.write = print

Config module

from dataclasses import dataclass
from datetime import datetime
from typing import List, Optional, Dict, Any

@dataclass
class JobConfig:
    """Configuration class for AWS Glue ETL jobs.

    This class centralizes all configuration parameters needed for Glue jobs,
    making it easier to manage and modify job settings.

    Attributes:
        env_name (str): Environment name (e.g., 'dev', 'prod')
        input_bucket (str): S3 bucket for input data
        output_bucket (str): S3 bucket for processed data
        error_bucket (str): S3 bucket for error files
        file_path (str): Base path in S3 buckets
        sns_topic_arn (str): ARN of SNS topic for notifications
        job_name (str): Name of the Glue job
        correlation_id (Optional[str]): Unique ID for tracing requests
        current_time (datetime): Timestamp for job execution
        spark_configs (Dict[str, Any]): Optional Spark configurations
    """

    # Required parameters
    env_name: str
    input_bucket: str
    output_bucket: str
    error_bucket: str
    file_path: str
    sns_topic_arn: str
    job_name: str

    # Optional parameters with defaults
    correlation_id: Optional[str] = None
    current_time: datetime = datetime.utcnow()
    spark_configs: Dict[str, Any] = None

    def __post_init__(self):
        """Validate configuration after initialization."""
        if not all([self.env_name, self.input_bucket, self.output_bucket, 
                   self.error_bucket, self.sns_topic_arn]):
            raise ValueError("Missing required configuration parameters")

        # Set default Spark configurations if none provided
        if self.spark_configs is None:
            self.spark_configs = {
                "spark.sql.adaptive.enabled": "true",
                "spark.sql.adaptive.coalescePartitions.enabled": "true",
                "spark.sql.shuffle.partitions": "200",
                # "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
                "spark.sql.broadcastTimeout": "600"
            }

    def get_s3_paths(self, file_name: str) -> Dict[str, str]:
        """Generate S3 paths for a given file.

        Args:
            file_name (str): Name of the file being processed

        Returns:
            Dict[str, str]: Dictionary containing input, output, and error paths
        """
        return {
            "input": f"s3://{self.input_bucket}/{self.env_name}/staging_{file_name}/",
            "output": f"s3://{self.output_bucket}/{self.env_name}/transform_{file_name}/",
            "error": f"s3://{self.error_bucket}/{self.env_name}/error_{file_name}"
        }

    def generate_correlation_id(self, file_name: str) -> str:
        """Generate a unique correlation ID for request tracing.

        Args:
            file_name (str): Name of the file being processed

        Returns:
            str: Unique correlation ID
        """
        if not self.correlation_id:
            self.correlation_id = f"{self.env_name}-{file_name}-{int(datetime.utcnow().timestamp())}"
        return self.correlation_id

Utils module

import sys
import json
import time
from datetime import datetime
from pyspark.sql.functions import col
from pyspark.sql.types import NumericType
from botocore.exceptions import ClientError
import boto3
from tenacity import retry, stop_after_attempt, wait_exponential
from pyspark.sql import SparkSession

# from config import JobConfig
# from logging_config import (
#     get_logger,
#     log_function_call,
#     log_performance,
#     log_s3_operation,
#     log_error_with_context,
#     log_dataframe_info,
# )

# # Get module-specific logger
# logger = get_logger("glue.utils")


@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def move_to_error_bucket(s3_client, source: dict, destination: dict):
    """Move file to error bucket with retry mechanism."""

    s3_client.copy_object(
        Bucket=destination["bucket"],
        CopySource={"Bucket": source["bucket"], "Key": source["key"]},
        Key=destination["key"],
    )


def notify_sns(sns_client, topic_arn, message):
    """Send a notification to an SNS topic."""
    log_function_call(logger, "notify_sns", topic_arn=topic_arn)
    try:
        sns_client.publish(TopicArn=topic_arn, Message=message)
        logger.info(
            "SNS notification sent successfully",
            extra={"topic_arn": topic_arn, "operation": "sns_publish"},
        )
    except ClientError as e:
        log_error_with_context(
            logger,
            e,
            {
                "operation": "sns_publish",
                "topic_arn": topic_arn,
                "message_length": len(message),
            },
        )
        sys.exit(1)


def check_files_exist(s3_client, bucket, env_name, file_path, file_names):
    """Check if files exist in S3 bucket."""

    log_function_call(
        logger,
        "check_files_exist",
        bucket=bucket,
        env_name=env_name,
        file_path=file_path,
        file_count=len(file_names),
    )

    for file_name in file_names:
        try:
            s3_path = f"{file_path}/{file_name}"
            s3_client.head_object(Bucket=bucket, Key=s3_path)
            log_s3_operation(logger, "check", bucket, s3_path, exists=True)
        except s3_client.exceptions.ClientError as e:
            log_error_with_context(
                logger,
                e,
                {
                    "operation": "check_file_exists",
                    "bucket": bucket,
                    "file_path": s3_path,
                    "env_name": env_name,
                },
            )
            sys.exit(1)


def delete_directory_in_s3(s3_client, bucket, directory_path):
    """Delete all files in S3 directory."""
    log_function_call(
        logger, "delete_directory_in_s3", bucket=bucket, directory_path=directory_path
    )

    try:
        objects = s3_client.list_objects_v2(Bucket=bucket, Prefix=directory_path)
        if "Contents" in objects:
            deleted_count = 0
            for obj in objects["Contents"]:
                s3_client.delete_object(Bucket=bucket, Key=obj["Key"])
                log_s3_operation(logger, "delete", bucket, obj["Key"])
                deleted_count += 1

            logger.info(
                "Directory cleanup completed",
                extra={
                    "bucket": bucket,
                    "directory": directory_path,
                    "files_deleted": deleted_count,
                },
            )
        else:
            logger.info(
                "No files to delete",
                extra={"bucket": bucket, "directory": directory_path},
            )
    except Exception as e:
        log_error_with_context(
            logger,
            e,
            {
                "operation": "delete_directory",
                "bucket": bucket,
                "directory_path": directory_path,
            },
        )
        sys.exit(1)


def save_quality_report(s3_client, bucket, env_name, file_name, quality_report):
    """Save data quality report to S3."""
    log_function_call(
        logger,
        "save_quality_report",
        bucket=bucket,
        env_name=env_name,
        file_name=file_name,
    )

    try:
        report_key = f"{env_name}/quality_reports/quality_report_{file_name}_{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}.json"

        s3_client.put_object(
            Bucket=bucket,
            Key=report_key,
            Body=json.dumps(quality_report, indent=2),
            ContentType="application/json",
        )

        log_s3_operation(
            logger,
            "write",
            bucket,
            report_key,
            size=len(json.dumps(quality_report)),
            content_type="application/json",
        )
    except Exception as e:
        log_error_with_context(
            logger,
            e,
            {
                "operation": "save_quality_report",
                "bucket": bucket,
                "env_name": env_name,
                "file_name": file_name,
            },
        )


def validate_dataframe(df, validation_rules: list[dict]) -> tuple[bool, list[str]]:
    """Validate DataFrame against defined rules."""
    errors = []

    for rule in validation_rules:
        if rule["type"] == "not_empty" and df.isEmpty():
            errors.append("DataFrame is empty")
        elif rule["type"] == "required_columns":
            missing = [col for col in rule["columns"] if col not in df.columns]
            if missing:
                errors.append(f"Missing required columns: {missing}")

    return len(errors) == 0, errors


def get_aws_clients():
    """Create and return AWS clients."""
    return {
        "s3": boto3.client("s3"),
        "sns": boto3.client("sns"),
    }


def optimize_spark_session(spark: SparkSession, config: JobConfig) -> SparkSession:
    """Apply Spark optimizations based on config."""
    for key, value in config.spark_configs.items():
        spark.conf.set(key, value)

    # Set dynamic partition pruning
    spark.conf.set("spark.sql.adaptive.enabled", "true")
    spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

    return spark


def check_dependencies(config: JobConfig) -> bool:
    """Verify all required services and resources are available."""
    try:
        s3 = boto3.client("s3")
        sns = boto3.client("sns")

        # Check S3 buckets
        for bucket in [config.input_bucket, config.output_bucket, config.error_bucket]:
            s3.head_bucket(Bucket=bucket)

        # Check SNS topic
        sns.get_topic_attributes(TopicArn=config.sns_topic_arn)

        return True
    except Exception as e:
        logger.error(f"Dependency check failed: {str(e)}")
        return False


def perform_data_quality_checks(df, file_name):
    """Perform data quality checks on DataFrame."""
    log_function_call(logger, "perform_data_quality_checks", file_name=file_name)
    start_time = time.time()

    # Log initial DataFrame info
    log_dataframe_info(logger, f"input_dataframe_{file_name}", df)

    quality_report = {
        "file_name": file_name,
        "total_rows": df.count(),
        "total_columns": len(df.columns),
        "quality_issues": [],
        "quality_score": 100.0,
        "is_valid": True,
    }

    logger.info(f"Starting data quality checks for {file_name}")

    # Check 1: Empty dataset
    if quality_report["total_rows"] == 0:
        quality_report["quality_issues"].append("Dataset is empty")
        quality_report["quality_score"] -= 50
        quality_report["is_valid"] = False
        return quality_report

    # Check 2: Null value analysis
    null_counts = {}
    for column in df.columns:
        null_count = df.filter(col(column).isNull()).count()
        null_percentage = (null_count / quality_report["total_rows"]) * 100
        null_counts[column] = {
            "count": null_count,
            "percentage": round(null_percentage, 2),
        }

        # Flag columns with high null percentage
        if null_percentage > 50:
            quality_report["quality_issues"].append(
                f"Column '{column}' has {null_percentage:.2f}% null values"
            )
            quality_report["quality_score"] -= 10
        elif null_percentage > 25:
            quality_report["quality_issues"].append(
                f"Column '{column}' has {null_percentage:.2f}% null values (warning)"
            )
            quality_report["quality_score"] -= 5

    quality_report["null_analysis"] = null_counts

    # Check 3: Duplicate rows
    duplicate_count = quality_report["total_rows"] - df.dropDuplicates().count()
    if duplicate_count > 0:
        duplicate_percentage = (duplicate_count / quality_report["total_rows"]) * 100
        quality_report["quality_issues"].append(
            f"Found {duplicate_count} duplicate rows ({duplicate_percentage:.2f}%)"
        )
        quality_report["duplicate_count"] = duplicate_count
        if duplicate_percentage > 10:
            quality_report["quality_score"] -= 15
        else:
            quality_report["quality_score"] -= 5

    # Check 4: Data type consistency
    schema_issues = []
    for field in df.schema.fields:
        column_name = field.name
        expected_type = field.dataType

        # Check for mixed data types in string columns
        if str(expected_type) == "StringType":
            # Check if column contains only numeric values (might need to be numeric)
            try:
                numeric_count = df.filter(
                    col(column_name).rlike("^[0-9]+\\.?[0-9]*$")
                ).count()
                if numeric_count == quality_report["total_rows"]:
                    schema_issues.append(
                        f"Column '{column_name}' contains only numeric values but is string type"
                    )
            except:
                pass

    if schema_issues:
        quality_report["quality_issues"].extend(schema_issues)
        quality_report["quality_score"] -= len(schema_issues) * 3

    # Check 5: Column name validation
    column_issues = []
    for column in df.columns:
        # Check for spaces in column names
        if " " in column:
            column_issues.append(f"Column '{column}' contains spaces")
        # Check for special characters
        if not column.replace("_", "").replace("-", "").isalnum():
            column_issues.append(f"Column '{column}' contains special characters")

    if column_issues:
        quality_report["quality_issues"].extend(column_issues)
        quality_report["quality_score"] -= len(column_issues) * 2

    # Check 6: Numeric column validation
    numeric_columns = [
        f.name for f in df.schema.fields if isinstance(f.dataType, NumericType)
    ]
    for column in numeric_columns:
        try:
            # Check for negative values where they might not be expected
            negative_count = df.filter(col(column) < 0).count()
            if negative_count > 0:
                quality_report["quality_issues"].append(
                    f"Column '{column}' has {negative_count} negative values"
                )

            # Check for extreme outliers (values beyond 3 standard deviations)
            stats = df.select(column).describe().collect()
            if len(stats) >= 3:  # mean, stddev available
                try:
                    mean_val = float(stats[1][1])  # mean
                    stddev_val = float(stats[2][1])  # stddev
                    if stddev_val > 0:
                        outlier_count = df.filter(
                            (col(column) > mean_val + 3 * stddev_val)
                            | (col(column) < mean_val - 3 * stddev_val)
                        ).count()
                        if outlier_count > 0:
                            outlier_percentage = (
                                outlier_count / quality_report["total_rows"]
                            ) * 100
                            if outlier_percentage > 5:
                                quality_report["quality_issues"].append(
                                    f"Column '{column}' has {outlier_count} potential outliers ({outlier_percentage:.2f}%)"
                                )
                except (ValueError, IndexError):
                    pass
        except Exception as e:
            logger.warning(
                f"Could not perform numeric validation on column '{column}': {e}"
            )

    # Final quality assessment
    if quality_report["quality_score"] < 70:
        quality_report["is_valid"] = False
    elif quality_report["quality_score"] < 85:
        quality_report["quality_issues"].append(
            "Data quality is below recommended threshold"
        )

    quality_report["quality_score"] = max(0, quality_report["quality_score"])

    logger.info(f"Data quality check completed for {file_name}")
    logger.info(f"Quality Score: {quality_report['quality_score']:.1f}/100")
    logger.info(f"Issues Found: {len(quality_report['quality_issues'])}")

    # Log performance metrics
    duration = time.time() - start_time
    log_performance(
        logger,
        "data_quality_check",
        duration,
        file_name=file_name,
        rows_processed=quality_report["total_rows"],
        issues_found=len(quality_report["quality_issues"]),
        quality_score=quality_report["quality_score"],
    )

    return quality_report


def track_job_metrics(metrics: dict) -> None:
    """Track job metrics for monitoring.

    Args:
        metrics (dict): Dictionary containing job metrics including:
            - duration: Total job duration
            - processed: Number of files processed
            - success_rate: Processing success rate
            - total_rows: Total number of rows processed
            - spark_metrics (optional): Spark-specific metrics
    """
    metric_data = {
        "processing_time": metrics.get("duration"),
        "files_processed": metrics.get("processed"),
        "success_rate": metrics.get("success_rate"),
        "total_rows": metrics.get("total_rows"),
    }

    logger.info("Job metrics", extra={"metrics": metric_data})

Example execution log output When running the job, your output might look like this:

glue.notebook        - 2025-11-07 11:37:25,573 - INFO - S3 CHECK - s3://***/2023_yellow_taxi_trip_data_light.csv | {'exists': True}

glue.notebook        - 2025-11-07 11:37:25,623 - INFO - No files to delete

glue.notebook        - 2025-11-07 11:37:25,668 - INFO - No files to delete

glue.notebook        - 2025-11-07 11:37:25,668 - INFO - Processing file 1/1: 2023_yellow_taxi_trip_data_light.csv

glue.notebook        - 2025-11-07 11:37:41,147 - INFO - DataFrame columns: ['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge', 'airport_fee']

glue.notebook        - 2025-11-07 11:37:49,241 - INFO - Performance - process_file_2023_yellow_taxi_trip_data_light.csv: 23.257s

glue.notebook        - 2025-11-07 11:37:49,242 - INFO -   correlation_id: kate-2023_yellow_taxi_trip_data_light.csv-1762515445

glue.notebook        - 2025-11-07 11:37:49,242 - INFO -   rows_processed: 4

glue.notebook        - 2025-11-07 11:37:49,242 - INFO -   read_duration: 15.443103075027466

glue.notebook        - 2025-11-07 11:37:49,242 - INFO -   write_duration: 6.300567388534546

glue.notebook        - 2025-11-07 11:37:49,242 - INFO - Job completed successfully - Processed 1/1 files in 25.20s

glue.notebook        - 2025-11-07 11:37:49,242 - INFO - Job metrics

glue.notebook        - 2025-11-07 11:37:49,330 - INFO - Spark session stopped
%stop_session
Stopping session: ccc5b160-7f7e-4c93-ab19-1eb0966a796e
Stopped session.glue.notebook        - 2025-11-07 11:37:25,573 - INFO - S3 CHECK - s3://***/2023_yellow_taxi_trip_data_light.csv | {'exists': True}

glue.notebook        - 2025-11-07 11:37:25,623 - INFO - No files to delete

glue.notebook        - 2025-11-07 11:37:25,668 - INFO - No files to delete

glue.notebook        - 2025-11-07 11:37:25,668 - INFO - Processing file 1/1: 2023_yellow_taxi_trip_data_light.csv

glue.notebook        - 2025-11-07 11:37:41,147 - INFO - DataFrame columns: ['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge', 'airport_fee']

glue.notebook        - 2025-11-07 11:37:49,241 - INFO - Performance - process_file_2023_yellow_taxi_trip_data_light.csv: 23.257s

glue.notebook        - 2025-11-07 11:37:49,242 - INFO -   correlation_id: kate-2023_yellow_taxi_trip_data_light.csv-1762515445

glue.notebook        - 2025-11-07 11:37:49,242 - INFO -   rows_processed: 4

glue.notebook        - 2025-11-07 11:37:49,242 - INFO -   read_duration: 15.443103075027466

glue.notebook        - 2025-11-07 11:37:49,242 - INFO -   write_duration: 6.300567388534546

glue.notebook        - 2025-11-07 11:37:49,242 - INFO - Job completed successfully - Processed 1/1 files in 25.20s

glue.notebook        - 2025-11-07 11:37:49,242 - INFO - Job metrics

glue.notebook        - 2025-11-07 11:37:49,330 - INFO - Spark session stopped
%stop_session
Stopping session: ccc5b160-7f7e-4c93-ab19-1eb0966a796e
Stopped session.

At the end of your notebook, include the following command to stop the session and avoid unnecessary costs.

%stop_session

In the very first cell, config idle timeout to terminate the session after a certain period of inactivity. For example %idle_timeout 20 . This setting makes sure your session will be terminated after 20 minutes of inactivity, preventing us from unexpected billing.

Thoughts

Convenient for testing and debugging: you can rerun only modified cells instead of the entire job.
Support develop, and test locally.
Managing extra py files can be troublesome especially if you update them frequently.
Additional setup might be required. For example ETL Jobs arguments

Reference:

Centralised Logging for AWS Glue Jobs with Python

Kate Vu — Mon, 27 Oct 2025 11:13:48 +0000

When I was working with AWS Glue jobs, I ran into a frustrating problem. I had a few Glue jobs and a utils module, and I’d set up logging in every module so the same code appeared in multiple places. Things got even worse when I needed to update the logger because I had to change the same logic everywhere. Now things got bitter, and I copied and pasted the same code everywhere in anger.
That’s when I realized I needed a centralized logging setup that I could reuse across all my Glue jobs. Centralized logging keeps things consistent, secure, easier to maintain, and makes debugging a lot simpler.

Why Centralize Logging?
Centralized logging brings several benefits:

Consistency: All jobs share the same log format, levels, and filtering — no surprises when you check different logs.
Maintainability: Update once, and every job gets the change automatically.
Structured Context: You can define a clear structure for your log messages and build a reusable log function to capture the details you need. In my case, I needed specific info for S3 operations — like bucket name, key, and action.
Flexibility: With dictionary-based configuration, updating formats or handlers later is super simple.
Traceability: Module-level loggers give you hierarchical control while still sharing the same handlers and formatters.
Security: You can set log levels differently for each environment. For example, log only INFO in production but allow DEBUG in dev. That way, sensitive data stays out of your production logs.

Setting Up the Centralized Logging Configuration
Here’s how I structured my logging_config module:

Define formatters, handlers, and loggers in setup_logging()

def setup_logging(
    level: str = "INFO",
    log_format: Optional[str] = None,
    log_file: Optional[str] = None,
    environment: str = "dev",
) -> logging.Logger:
    """
    Set up centralized logging configuration.

    Args:
        level: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
        log_format: Custom log format string
        log_file: Path to log file (optional)
        environment: Environment name (dev, staging, prod)

    Returns:
        Configured logger instance
    """

    # Default format with structured information
    if log_format is None:
        log_format = (
            "%(asctime)s | %(levelname)-8s | %(name)-20s | "
            "%(funcName)-15s:%(lineno)-4d | %(message)s"
        )

    # Create handlers
    handlers = {
        "stdout": {
            "class": "logging.StreamHandler",
            "level": "INFO",
            "formatter": "simple",
            "stream": "ext://sys.stdout",
        },
        "stderr": {
            "class": "logging.StreamHandler",
            "level": "ERROR",
            "formatter": "simple",
            "stream": "ext://sys.stderr",
        },
    }

    # Add file handler if specified
    if log_file:
        handlers["file"] = {
            "class": "logging.handlers.RotatingFileHandler",
            "level": level,
            "formatter": "detailed",
            "filename": log_file,
            "maxBytes": 10485760,  # 10MB
            "backupCount": 5,
            "encoding": "utf8",
        }

    # Logging configuration dictionary
    logging_config = {
        "version": 1,
        "disable_existing_loggers": False,  # Important to keep this False
        "formatters": {
            "detailed": {"format": log_format, "datefmt": "%Y-%m-%d %H:%M:%S"},
            "simple": {"format": "%(name)-20s - %(levelname)s - %(message)s"},
        },
        "handlers": handlers,
        "loggers": {
            # Root logger - Parent of all loggers
            "": {
                "level": level,
                "handlers": list(handlers.keys()),
            },
            # AWS SDK loggers (reduce noise)
            "boto3": {"level": "WARNING"},
            "botocore": {"level": "WARNING"},
            "urllib3": {"level": "WARNING"},
            # PySpark loggers
            "pyspark": {"level": "WARNING"},
            "py4j": {"level": "WARNING"},
        },
    }

    # Apply configuration
    logging.config.dictConfig(logging_config)

    # Get logger for the calling module
    logger = logging.getLogger(__name__)

    # Log configuration info
    logger.info(f"Logging configured - Level: {level}, Environment: {environment}")
    if log_file:
        logger.info(f"Log file: {log_file}")

    return logger

Define purpose specific logging functions

def log_function_call(logger: logging.Logger, func_name: str, **kwargs):
    """
    Log function entry with parameters.

    Args:
        logger: Logger instance
        func_name: Function name
        **kwargs: Function parameters to log
    """
    params = {
        k: v for k, v in kwargs.items() if k not in ["password", "secret", "token"]
    }
    logger.debug(f"Entering {func_name} with params: {params}")


def log_performance(logger: logging.Logger, operation: str, duration: float, **metrics):
    """
    Log performance metrics.

    Args:
        logger: Logger instance
        operation: Operation name
        duration: Duration in seconds
        **metrics: Additional metrics to log
    """
    logger.info(f"Performance - {operation}: {duration:.3f}s")
    if metrics:
        for key, value in metrics.items():
            logger.info(f"  {key}: {value}")


def log_dataframe_info(logger: logging.Logger, df_name: str, df):
    """
    Log DataFrame information for debugging.

    Args:
        logger: Logger instance
        df_name: DataFrame name/description
        df: PySpark DataFrame
    """
    try:
        row_count = df.count()
        col_count = len(df.columns)
        logger.info(
            f"DataFrame '{df_name}' - Rows: {row_count:,}, Columns: {col_count}"
        )
        logger.debug(f"DataFrame '{df_name}' columns: {df.columns}")
    except Exception as e:
        logger.warning(f"Could not get DataFrame info for '{df_name}': {e}")


def log_s3_operation(
    logger: logging.Logger, operation: str, bucket: str, key: str, **kwargs
):
    """
    Log S3 operations with consistent format.

    Args:
        logger: Logger instance
        operation: S3 operation (read, write, delete, etc.)
        bucket: S3 bucket name
        key: S3 object key
        **kwargs: Additional operation details
    """
    details = f" | {kwargs}" if kwargs else ""
    logger.info(f"S3 {operation.upper()} - s3://{bucket}/{key}{details}")


def log_error_with_context(
    logger: logging.Logger, error: Exception, context: Dict[str, Any]
):
    """
    Log errors with additional context information.

    Args:
        logger: Logger instance
        error: Exception that occurred
        context: Additional context information
    """
    logger.error(f"Error: {str(error)}")
    logger.error(f"Context: {json.dumps(context, indent=2, default=str)}")
    logger.exception("Full traceback:")

Using the logger in AWS Glue Jobs

# Initialize root logger
setup_logging(level="INFO", environment=args["env_name"])

# Get module level logger
logger = get_logger("glue.ingestion")

# Or
logger = get_logger("glue.utils")

# Example:
logger.info(f"Writing processed data to: {output_s3_path}")
       log_s3_operation(
           logger,
           "write",
           output_bucket,
           f"{env_name}/staging_{file_name.split('.')[0]}/",
       )

With this setup, logs clearly show where they came from

Logger vs print()
print() is easy to use and requires no setup. For quick checks, it works fine. But once your pipelines grow, run in parallel, or deploy across environments, it becomes harder to track and manage.
Here’s why I chose logger over print() for my case:

Control: Use log levels (DEBUG, INFO, ERROR, etc.) to filter output without touching code everywhere.
Consistency: All jobs follow the same format with timestamps, module names, and job context, making logs readable and traceable.
Security: You can debug in development without risking sensitive information appearing in production logs by setting log levels — something print() would require extra effort to manage.

References
https://docs.python.org/3/howto/logging.html
https://docs.python.org/3/howto/logging-cookbook.html

Automatic Trigger Data Pipeline with AWS using AWS CDK

Kate Vu — Tue, 08 Jul 2025 13:24:18 +0000

This blog describes how to build an automated data pipeline to ingest, simply transform data. The pipeline will use AWS Step Functions to orchestrate, and AWS Glue to transform the data. AWS SNS will be used to notify the user if there is an error or when the run is finished.
I usually build infrastructure via AWS CDK with TypeScript, but for this one, I’ll use Python — with great help from GitHub Copilot along the way.

Prerequisites:

AWS Accounts

Architecture

AWS Resources:

AWS S3
AWS Lambda
AWS StepFunction
AWS Glue
AWS Athena
AWS SNS

Create the CDK project:

mkdir aws-cdk-glue && cd aws-cdk-glue
cdk init app --language python
source .venv/bin/activate
python3 -m pip install -r requirements.txt

Create S3 buckets

Let’s quickly create a list of S3 buckets we’ll need.

```
from aws_cdk import Stack, RemovalPolicy
from aws_cdk import aws_s3 as s3
from constructs import Construct

class S3BucketsStack(Stack):
    def __init__(
        self, scope: Construct, construct_id: str, bucket_names: list, **kwargs
    ) -> None:
        super().__init__(scope, construct_id, **kwargs)

        # Create S3 buckets from the provided list of bucket names
        for bucket_name in bucket_names:
            s3.Bucket(
                self,
                f"S3Bucket-{bucket_name}",
                bucket_name=bucket_name,
                versioned=True,  # Enable versioning for the bucket
                removal_policy=RemovalPolicy.DESTROY,  # Retain bucket on stack deletion
                auto_delete_objects=False,  # Prevent accidental deletion of objects
```

Now, define the list of bucket names and create them in your app.py file:

# Create the S3 buckets stack
bucket_names = [
    "kate-source-data",
    "kate-staging-data",
    "kate-error-staging-data",
    "kate-transform-data",
    "kate-error-transform-data",
]
s3_buckets_stack = S3BucketsStack(
    app,
    construct_id="S3BucketsStack",
    bucket_names=bucket_names,
    env=environment,
)

Run the command below to deploy the stack:

cdk deploy S3BucketsStack

Then head over to the AWS Console to verify that all the buckets have been created.

3. Generate data

To generate test data, run the following command:

python ./generate_data.py 1000 1000

This will create two CSV files in your current directory:

customer_data.csv — contains 1,000 customer records
transaction_data.csv — contains 1,000 transaction records These files will be used as source data in the pipeline.

import csv
import random
import argparse
import uuid
import os  # Import os for folder creation


def generate_customer_data(num_records):
    first_names = ["John", "Jane", "Alice", "Bob", "Charlie", "Diana"]
    last_names = ["Smith", "Doe", "Johnson", "Williams", "Brown", "Jones"]
    genders = ["Male", "Female"]

    data = []
    for i in range(1, num_records + 1):
        first_name = random.choice(first_names)
        last_name = random.choice(last_names)
        gender = random.choice(genders)
        data.append(
            {
                "id": i,
                "first_name": first_name,
                "last_name": last_name,
                "gender": gender,
            }
        )

    return data


def save_to_csv(data, folder, filename):
    # Ensure the folder exists
    os.makedirs(folder, exist_ok=True)
    filepath = os.path.join(folder, filename)
    with open(filepath, mode="w", newline="") as file:
        writer = csv.DictWriter(file, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)


def generate_transaction_data(customer_ids, num_records):
    transaction_types = ["Purchase", "Refund", "Exchange"]
    data = []
    for _ in range(num_records):
        transaction_id = str(uuid.uuid4())  # Generate a unique transaction ID
        customer_id = random.choice(customer_ids)
        transaction_type = random.choice(transaction_types)
        amount = round(random.uniform(10.0, 500.0), 2)
        data.append(
            {
                "transaction_id": transaction_id,
                "customer_id": customer_id,
                "transaction_type": transaction_type,
                "amounttt": amount,
            }
        )

    return data


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Generate customer and transaction data."
    )
    parser.add_argument(
        "num_customer_records", type=int, help="Number of customer records to generate"
    )
    parser.add_argument(
        "num_transaction_records",
        type=int,
        help="Number of transaction records to generate",
    )
    args = parser.parse_args()

    # Define the folder to save files
    folder = "test_data"

    # Generate customer data
    customer_data = generate_customer_data(args.num_customer_records)
    save_to_csv(customer_data, folder, "customer_data.csv")

    # Extract customer IDs
    customer_ids = [customer["id"] for customer in customer_data]

    # Generate transaction data
    transaction_data = generate_transaction_data(
        customer_ids, args.num_transaction_records
    )
    save_to_csv(transaction_data, folder, "transaction_data.csv")

4. Create PySpark script to ingest the data

In this step, we’ll create a PySpark script to handle the data ingestion. The script will:

Check if all required files exist in the source data bucket. If any file is missing, it will send a notification to the SNS topic and stop the job.
If all required files are present:
- Read the files from S3
- Add ingestion_start_time and ingestion_end_time columns
- Write the data to the output path in Parquet format
If there’s an error while processing the files:
- Write the error file to the error bucket
- Notify SNS topic

import sys
import logging
from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from awsglue.utils import getResolvedOptions
import boto3
from botocore.exceptions import ClientError

# Configure logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)


def check_files_exist(s3_client, bucket, env_name, file_path, file_names):
    """
    Check if the specified files exist in the given S3 bucket.

    :param s3_client: Boto3 S3 client
    :param bucket: Name of the S3 bucket
    :param file_names: List of file names to check
    """
    for file_name in file_names:
        try:
            s3_client.head_object(Bucket=bucket, Key=f"{file_path}/{file_name}")
            logger.info(f"File {file_path}/{file_name} exists in bucket {bucket}.")
        except s3_client.exceptions.ClientError as e:
            logger.error(
                f"File {file_path}/{file_name} does not exist in bucket {bucket}: {e} Exiting."
            )
            sys.exit(1)


def delete_directory_in_s3(s3_client, bucket, directory_path):
    """
    Delete all files in the specified directory in the S3 bucket.

    :param s3_client: Boto3 S3 client
    :param bucket: Name of the S3 bucket
    :param directory_path: Path of the directory to delete
    """
    try:
        objects = s3_client.list_objects_v2(Bucket=bucket, Prefix=directory_path)
        if "Contents" in objects:
            for obj in objects["Contents"]:
                s3_client.delete_object(Bucket=bucket, Key=obj["Key"])
                logger.info(f"Deleted file: {obj['Key']} from bucket {bucket}")
        else:
            logger.info(f"No files found in {directory_path} to delete.")
    except Exception as e:
        logger.error(f"Error deleting directory {directory_path}: {e}")
        sys.exit(1)


def notify_sns(sns_client, topic_arn, message):
    """
    Send a notification to an SNS topic.

    :param sns_client: Boto3 SNS client
    :param topic_arn: ARN of the SNS topic
    :param message: Message to send
    """
    try:
        sns_client.publish(TopicArn=topic_arn, Message=message)
        logger.info(f"Notification sent to SNS topic: {topic_arn}")
    except ClientError as e:
        logger.error(f"Failed to send notification to SNS topic: {e}")
        sys.exit(1)


def process_file(
    spark,
    s3_client,
    sns_client,
    sns_topic_arn,
    input_bucket,
    output_bucket,
    error_bucket,
    file_path,
    file_name,
    env_name,
    current_time,
):
    """
    Process a single file: read from S3, transform, and write to S3.

    :param spark: SparkSession object
    :param s3_client: Boto3 S3 client
    :param sns_client: Boto3 SNS client
    :param sns_topic_arn: ARN of the SNS topic
    :param input_bucket: Name of the input S3 bucket
    :param output_bucket: Name of the output S3 bucket
    :param error_bucket: Name of the error S3 bucket
    :param file_name: Name of the file to process
    :param env_name: Environment name (e.g., dev, prod)
    """
    input_s3_path = f"s3://{input_bucket}/{file_path}/{file_name}"
    output_s3_path = (
        f"s3://{output_bucket}/{env_name}/staging_{file_name.split('.')[0]}/"
    )
    error_s3_path = f"s3://{error_bucket}/{env_name}/error_{file_name}"

    logger.info(f"Processing file: {file_name}")
    logger.info(f"Reading from: {input_s3_path}")

    try:
        # Get the current UTC time for ingestion start
        ingestion_start_time = current_time.strftime("%Y-%m-%d %H:%M:%S")

        # Read CSV file from S3
        df = spark.read.csv(input_s3_path, header=True, inferSchema=True)
        logger.info(f"Finish reading file from s3")
        if not df.columns:
            raise RuntimeError(
                "The DataFrame is empty. Cannot proceed with processing."
            )

        # Add ingestion_start_time column to the DataFrame
        df = df.withColumn("ingestion_start_time", lit(ingestion_start_time))

        # Get the current UTC time for ingestion finish
        ingestion_finish_time = datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")

        # Add ingestion_finish_time column to the DataFrame
        df = df.withColumn("ingestion_finish_time", lit(ingestion_finish_time))

        # Write the DataFrame to S3 in Parquet format
        logger.info(f"Writing to: {output_s3_path}")
        df.write.parquet(output_s3_path, mode="overwrite")
        logger.info(f"Successfully processed and saved file: {file_name}")

    except Exception as e:
        logger.error(f"Error processing file {file_name}: {e}")
        logger.info(f"Copying original file to error bucket: {error_s3_path}")
        try:
            s3_client.copy_object(
                Bucket=error_bucket,
                CopySource={"Bucket": input_bucket, "Key": f"{file_path}/{file_name}"},
                Key=f"{env_name}/error_{file_name}",
            )
            logger.info(f"Successfully copied file {file_name} to error bucket.")
        except ClientError as copy_error:
            logger.error(
                f"Failed to copy file {file_name} to error bucket: {copy_error}"
            )

        # Notify SNS about the error
        error_message = f"Error processing file {file_name} in environment {env_name}. Original file copied to error bucket."
        notify_sns(sns_client, sns_topic_arn, error_message)


def main():
    # Get arguments passed to the Glue job
    args = getResolvedOptions(
        sys.argv,
        [
            "env_name",
            "input_bucket",
            "output_bucket",
            "error_bucket",
            "file_path",
            "file_names",
            "sns_topic_arn",  # Added SNS topic ARN argument
            "JOB_NAME",
        ],
    )

    # Extract input and output S3 paths from arguments
    input_bucket = args["input_bucket"]
    output_bucket = args["output_bucket"]
    error_bucket = args["error_bucket"]
    file_path = args["file_path"]
    file_names = args["file_names"].split(
        ","
    )  # Expecting a comma-separated list of file names
    env_name = args["env_name"]
    sns_topic_arn = args["sns_topic_arn"]  # Extract SNS topic ARN

    # Initialize S3 and SNS clients
    s3_client = boto3.client("s3")
    sns_client = boto3.client("sns")

    # Check if files exist in the input bucket
    check_files_exist(s3_client, input_bucket, env_name, file_path, file_names)

    # Delete the entire directory in the output bucket before processing files
    output_directory_path = f"{env_name}/"
    delete_directory_in_s3(s3_client, output_bucket, output_directory_path)

    # Delete the error bucket folder before processing files
    error_directory_path = f"{env_name}/"
    delete_directory_in_s3(s3_client, error_bucket, error_directory_path)

    # Initialize Spark session
    spark = SparkSession.builder.appName(args["JOB_NAME"]).getOrCreate()

    # Process each file
    current_time = datetime.utcnow()
    for file_name in file_names:
        process_file(
            spark,
            s3_client,
            sns_client,
            sns_topic_arn,
            input_bucket,
            output_bucket,
            error_bucket,
            file_path,
            file_name,
            env_name,
            current_time,
        )

    # Stop the Spark session
    spark.stop()
    logger.info("Spark session stopped.")


if __name__ == "__main__":
    main()

5. Create PySpark script to transform the data
For this example, the transformation is very simple: Just correcting the name of a column in the transaction_data file.

import sys
import logging
from datetime import datetime
from pyspark.sql import SparkSession
from awsglue.utils import getResolvedOptions
from pyspark.sql.functions import col
import boto3
from botocore.exceptions import ClientError

# Configure logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

def notify_sns(sns_client, topic_arn, message):
    """
    Send a notification to an SNS topic.

    :param sns_client: Boto3 SNS client
    :param topic_arn: ARN of the SNS topic
    :param message: Message to send
    """
    try:
        sns_client.publish(TopicArn=topic_arn, Message=message)
        logger.info(f"Notification sent to SNS topic: {topic_arn}")
    except ClientError as e:
        logger.error(f"Failed to send notification to SNS topic: {e}")

def process_file(
    spark,
    input_bucket,
    output_bucket,
    error_bucket,
    env_name,
    file_name,
    sns_client,
    sns_topic_arn,
):
    """
    Process individual files based on their type.

    :param spark: SparkSession object
    :param input_bucket: Name of the input S3 bucket
    :param output_bucket: Name of the output S3 bucket
    :param error_bucket: Name of the error S3 bucket
    :param env_name: Environment name (e.g., dev, prod)
    :param file_name: Name of the file to process
    :param sns_client: Boto3 SNS client
    :param sns_topic_arn: ARN of the SNS topic
    """
    input_path = f"s3://{input_bucket}/{env_name}/staging_{file_name}"
    output_path = f"s3://{output_bucket}/{env_name}/transformation_{file_name}"
    error_path = f"s3://{error_bucket}/{env_name}/{file_name}_error.log"

    logger.info(f"Processing file: {file_name}")
    logger.info(f"Reading data from: {input_path}")

    try:
        # Read Parquet file from S3
        df = spark.read.parquet(input_path)

        if file_name == "customer_data":
            # Copy customer_data as is
            logger.info("Copying customer_data without transformation.")
        elif file_name == "transaction_data":
            # Rename column 'amounttt' to 'amount' for transaction_data
            logger.info("Renaming column 'amounttt' to 'amount' for transaction_data.")
            df = df.withColumnRenamed("amounttt", "amount")
        else:
            logger.warning(f"Unknown file type: {file_name}. Skipping transformation.")

        # Write the processed data to S3 in Parquet format
        logger.info(f"Writing processed data to: {output_path}")
        df.write.parquet(output_path, mode="overwrite")

        logger.info(f"Successfully processed file: {file_name}")
    except Exception as e:
        logger.error(f"Error processing file {file_name}: {e}")
        logger.info(f"Copying original file to error bucket: {error_path}")
        try:
            s3_client = boto3.client("s3")
            s3_client.copy_object(
                Bucket=error_bucket,
                CopySource={
                    "Bucket": input_bucket,
                    "Key": f"{env_name}/staging_{file_name}",
                },
                Key=f"{env_name}/{file_name}_error.log",
            )
            logger.info(f"Successfully copied file {file_name} to error bucket.")
        except ClientError as copy_error:
            logger.error(
                f"Failed to copy file {file_name} to error bucket: {copy_error}"
            )

        # Notify SNS about the error
        error_message = f"Error processing file {file_name} in environment {env_name}. Original file copied to error bucket."
        notify_sns(sns_client, sns_topic_arn, error_message)

def main():
    # Get arguments passed to the Glue job
    args = getResolvedOptions(
        sys.argv,
        [
            "JOB_NAME",
            "env_name",
            "input_bucket",
            "output_bucket",
            "error_bucket",
            "file_names",
            "sns_topic_arn",  # Added SNS topic ARN argument
        ],
    )

    # Extract arguments
    env_name = args["env_name"]
    input_bucket = args["input_bucket"]
    output_bucket = args["output_bucket"]
    error_bucket = args["error_bucket"]
    file_names = args["file_names"].split(",")  # Comma-separated list of file names
    sns_topic_arn = args["sns_topic_arn"]

    # Initialize Spark session
    spark = SparkSession.builder.appName(args["JOB_NAME"]).getOrCreate()

    # Initialize SNS client
    sns_client = boto3.client("sns")

    # Loop through files and process them
    for file_name in file_names:
        process_file(
            spark,
            input_bucket,
            output_bucket,
            error_bucket,
            env_name,
            file_name,
            sns_client,
            sns_topic_arn,
        )

    # Stop the Spark session
    spark.stop()
    logger.info("Spark session stopped.")

if __name__ == "__main__":
    main()

6. Create Glue jobs for ingestion and transformation
First, create a new construct to handle this. The construct will:

Create an IAM role for the Glue jobs
- Read from the source data bucket
- Write to the destination and error buckets
- Publish messages to the SNS topic
Create Glue jobs

from aws_cdk import (
    aws_glue as glue,
    aws_iam as iam,
    aws_s3_assets as s3_assets,
)
from constructs import Construct
import os.path as path


class GlueContruct(Construct):

    def __init__(
        self,
        scope: Construct,
        id: str,
        env_name: str,
        input_bucket: str,
        output_bucket: str,
        error_bucket: str,
        file_names: list,
        script_file_path: str,
        glue_job_prefix: str,
        sns_topic_arn: str,
        **kwargs,
    ) -> None:
        super().__init__(scope, id, **kwargs)
        # Define an IAM role for the Glue job
        glue_role = iam.Role(
            self,
            "GlueJobRole",
            assumed_by=iam.ServicePrincipal("glue.amazonaws.com"),
            managed_policies=[
                iam.ManagedPolicy.from_aws_managed_policy_name(
                    "service-role/AWSGlueServiceRole"
                )
            ],
        )

        # Add permissions to read from the input bucket and write to the output bucket
        glue_role.add_to_policy(
            iam.PolicyStatement(
                actions=["s3:GetObject", "s3:ListBucket"],
                resources=[
                    f"arn:aws:s3:::{input_bucket}/{env_name}/*",
                    f"arn:aws:s3:::{input_bucket}/{env_name}",
                ],
            )
        )

        glue_role.add_to_policy(
            iam.PolicyStatement(
                actions=[
                    "s3:PutObject",
                    "s3:GetObject",
                    "s3:ListBucket",
                    "s3:DeleteObject",
                ],
                resources=[
                    f"arn:aws:s3:::{output_bucket}/{env_name}/*",
                    f"arn:aws:s3:::{output_bucket}/{env_name}",
                    f"arn:aws:s3:::{error_bucket}/{env_name}/*",
                    f"arn:aws:s3:::{error_bucket}/{env_name}",
                ],
            )
        )

        # Add permissions to publish messages to the SNS topic
        glue_role.add_to_policy(
            iam.PolicyStatement(
                actions=["sns:Publish"],
                resources=[sns_topic_arn],
            )
        )

        # Upload the Glue script to an S3 bucket using an S3 asset
        glue_script_asset = s3_assets.Asset(
            self,
            "GlueScriptAsset",
            path=path.join(
                path.dirname(__file__), script_file_path
            ),  # Replace with the local path to your script
        )

        glue_script_asset.grant_read(glue_role)

        # Define the Glue job
        self.glue_job = glue.CfnJob(
            self,
            glue_job_prefix + env_name,
            name=f"{glue_job_prefix}-{env_name}",
            role=glue_role.role_arn,
            command={
                "name": "glueetl",
                "scriptLocation": glue_script_asset.s3_object_url,
                "pythonVersion": "3",
            },
            default_arguments={
                "--env_name": env_name,
                "--input_bucket": input_bucket,
                "--output_bucket": output_bucket,
                "--error_bucket": error_bucket,
                "--file_names": ",".join(file_names),
                "--file_path": "test",  # Assuming files are in a 'data' folder for now
                "--sns_topic_arn": sns_topic_arn,
            },
            max_retries=0,
            timeout=10,
            glue_version="5.0",
        )

Now we can use new construct to create Glue Jobs

Ingestion job — runs the PySpark script that validates and ingests the data
Transformation job — runs the script that corrects the column name

        # Create the Glue ingestion construct
        glue_ingestion = GlueContruct(
            self,
            "GlueIngestion",
            env_name=env_name,
            input_bucket=account_config["ingestion"]["input_bucket"],
            output_bucket=account_config["ingestion"]["output_bucket"],
            error_bucket=account_config["ingestion"]["error_bucket"],
            file_names=account_config["ingestion"]["file_names"],
            script_file_path="../../scripts/glue/ingestion.py",  # Path to your Glue script
            glue_job_prefix="IngestionJob",
            sns_topic_arn=sns_topic.topic_arn,  # Pass the SNS topic ARN
        )

        glue_transformation = GlueContruct(
            self,
            "GlueTransformation",
            env_name=env_name,
            input_bucket=account_config["transformation"]["input_bucket"],
            output_bucket=account_config["transformation"]["output_bucket"],
            error_bucket=account_config["transformation"]["error_bucket"],
            file_names=account_config["transformation"]["file_names"],
            script_file_path="../../scripts/glue/transformation.py",  # Path to your Glue script
            glue_job_prefix="TransformationJob",
            sns_topic_arn=sns_topic.topic_arn,  # Pass the SNS topic ARN
        )
        # Output Glue job names
        add_output(self, "GlueIngestionJobName", glue_ingestion.glue_job.name)
        add_output(self, "GlueTransformationJobName", glue_transformation.glue_job.name)

7. Create database, table, crawler
Since we want to query the data later using Amazon Athena, we need to create the following components:

An AWS Glue Database
AWS Glue Tables
Glue Crawlers to crawl the data and update the AWS Glue Data Catalog Glue Crawlers can automatically create tables, which is convenient. However, manually creating tables gives you more control over the metadata — can be useful if your AWS account has Lake Formation enabled, and you want to manage access via the Data Lake.

import os
from aws_cdk import aws_glue as glue
from aws_cdk import aws_iam as iam
from constructs import Construct
import aws_cdk.aws_lakeformation as lakeformation
from aws_cdk_glue.utils.utils import add_output  # Import the add_output function


def create_glue_role(
    scope: Construct,
    id: str,
    env_name: str,
    output_bucket: str,
    account_id: str,
    region: str,
) -> iam.Role:
    """Create an IAM role for the Glue crawler with necessary permissions."""
    crawler_role = iam.Role(
        scope,
        id,
        assumed_by=iam.ServicePrincipal("glue.amazonaws.com"),
        managed_policies=[
            iam.ManagedPolicy.from_aws_managed_policy_name(
                "service-role/AWSGlueServiceRole"
            )
        ],
    )

    # Add permissions to access the staging bucket
    crawler_role.add_to_policy(
        iam.PolicyStatement(
            actions=[
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:DeleteObject",
            ],
            resources=[
                f"arn:aws:s3:::{output_bucket}",
                f"arn:aws:s3:::{output_bucket}/{env_name}/*",
            ],
        )
    )

    # Add permissions to access Athena databases
    crawler_role.add_to_policy(
        iam.PolicyStatement(
            actions=[
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:UpdateTable",
                "glue:CreateTable",
                "glue:UpdatePartition",
                "glue:GetPartition",
                "glue:BatchGetPartition",
                "glue:BatchCreatePartition",
            ],
            resources=[
                f"arn:aws:glue:{region}:{account_id}:catalog",
                f"arn:aws:glue:{region}:{account_id}:database/{env_name}_database",
            ],
        )
    )

    return crawler_role


def create_glue_table(
    scope: Construct,
    id: str,
    table_name: str,  # Pass table name as a parameter
    env_name: str,
    output_bucket: str,
    account_id: str,
    glue_database: glue.CfnDatabase,
) -> glue.CfnTable:
    """Create a Glue table for the Athena database."""
    glue_table = glue.CfnTable(
        scope,
        id,
        catalog_id=account_id,  # Assign account_id to catalog_id
        database_name=glue_database.ref,  # Reference the Glue database
        table_input={
            "name": table_name,  # Use the passed table name
            "storageDescriptor": {
                "location": f"s3://{output_bucket}/{env_name}/{table_name}/",  # Path to the data in S3
                "inputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",  # Input format for the table
                "outputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",  # Output format for the table
                "serdeInfo": {
                    "serializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",  # SerDe library
                    "parameters": {"classification": "Parquet"},
                },
            },
            "tableType": "EXTERNAL_TABLE",  # Define the table type as external
        },
    )
    return glue_table


class GlueTable(Construct):
    def __init__(
        self,
        scope: Construct,
        id: str,
        env_name: str,
        staging_bucket: str,
        staging_file_names: list,
        transformation_bucket: str,
        transformation_file_names: list,
        account_id: str,
        region: str,
        **kwargs,
    ) -> None:
        super().__init__(scope, id, **kwargs)
        database_name = f"{env_name}_database"

        tag_key = "kate"
        tag_values = ["test"]

        # Define the Glue database
        glue_database = glue.CfnDatabase(
            self,
            "GlueDatabase",
            catalog_id=account_id,
            database_input={"name": database_name},
            database_name=database_name,
        )

        lf_tag_pair_property = lakeformation.CfnTagAssociation.LFTagPairProperty(
            catalog_id=account_id, tag_key=tag_key, tag_values=tag_values
        )

        tag_association = lakeformation.CfnTagAssociation(
            self,
            "TagAssociation",
            lf_tags=[lf_tag_pair_property],
            resource=lakeformation.CfnTagAssociation.ResourceProperty(
                database=lakeformation.CfnTagAssociation.DatabaseResourceProperty(
                    catalog_id=account_id, name=database_name
                )
            ),
        )
        tag_association.node.add_dependency(glue_database)

        crawler_role_staging: iam.Role = create_glue_role(
            self,
            f"GlueCrawlerRoleStaging-{env_name}",
            env_name,
            staging_bucket,
            account_id,
            region,
        )

        # Grant permissions for database
        grant_staging_crawler_database_access = lakeformation.CfnPermissions(
            self,
            "LFDatabasePermissions",
            data_lake_principal={
                "dataLakePrincipalIdentifier": crawler_role_staging.role_arn
            },
            resource=lakeformation.CfnPermissions.ResourceProperty(
                database_resource=lakeformation.CfnPermissions.DatabaseResourceProperty(
                    catalog_id=account_id, name=database_name
                ),
            ),
            permissions=["ALTER", "DROP", "DESCRIBE", "CREATE_TABLE"],
        )

        grant_staging_crawler_database_access.node.add_dependency(
            glue_database, tag_association
        )

        for file_name in staging_file_names:
            # Remove file type from file name
            table_name = f"staging_{os.path.splitext(file_name)[0]}"  # Extract the base name without file extension

            # Create a Glue table for each file name
            glue_table = create_glue_table(
                self,
                f"StagingGlueTable-{table_name}",
                table_name=table_name,  # Use the file name as the table name
                env_name=env_name,
                output_bucket=staging_bucket,
                account_id=account_id,
                glue_database=glue_database,
            )
            # Grant permissions for tables
            grant_table_access = lakeformation.CfnPermissions(
                self,
                f"StagingGlueTableLFTablePermissions-{file_name}",
                data_lake_principal={
                    "dataLakePrincipalIdentifier": crawler_role_staging.role_arn
                },
                resource=lakeformation.CfnPermissions.ResourceProperty(
                    table_resource=lakeformation.CfnPermissions.TableResourceProperty(
                        database_name=database_name, name=table_name
                    )
                ),
                permissions=["SELECT", "ALTER", "DROP", "INSERT", "DESCRIBE"],
            )
            grant_table_access.node.add_dependency(glue_database, tag_association)

            # Output the Glue table name using add_output
            add_output(
                self,
                f"StagingGlueTableName-{table_name}",
                glue_table.ref,
            )

        # Define the Glue crawler
        glue_crawler_staging = glue.CfnCrawler(
            self,
            "GlueCrawlerStaging",
            name=f"{env_name}_staging_crawler",
            role=crawler_role_staging.role_arn,  # Use the created IAM role
            database_name=glue_database.ref,
            targets={"s3Targets": [{"path": f"s3://{staging_bucket}/{env_name}/"}]},
        )

        # Output the Glue crawler name using add_output
        add_output(
            self,
            "GlueStagingCrawlerName",
            glue_crawler_staging.ref,
        )

        # Transformation Process
        crawler_role_transformation: iam.Role = create_glue_role(
            self,
            f"GlueCrawlerRoleTransformation-{env_name}",
            env_name,
            transformation_bucket,
            account_id,
            region,
        )

        grant_transformation_crawler_database_access = lakeformation.CfnPermissions(
            self,
            "LFDatabasePermissionsTransformation",
            data_lake_principal={
                "dataLakePrincipalIdentifier": crawler_role_transformation.role_arn
            },
            resource=lakeformation.CfnPermissions.ResourceProperty(
                database_resource=lakeformation.CfnPermissions.DatabaseResourceProperty(
                    catalog_id=account_id, name=database_name
                ),
            ),
            permissions=["ALTER", "DROP", "DESCRIBE", "CREATE_TABLE"],
        )

        grant_transformation_crawler_database_access.node.add_dependency(
            glue_database, tag_association
        )

        for file_name in transformation_file_names:
            table_name = f"transformation_{os.path.splitext(file_name)[0]}"

            glue_table = create_glue_table(
                self,
                f"TransformationGlueTable-{table_name}",
                table_name=table_name,
                env_name=env_name,
                output_bucket=transformation_bucket,
                account_id=account_id,
                glue_database=glue_database,
            )

            grant_table_access = lakeformation.CfnPermissions(
                self,
                f"TransformationGlueTableLFTablePermissions-{file_name}",
                data_lake_principal={
                    "dataLakePrincipalIdentifier": crawler_role_transformation.role_arn
                },
                resource=lakeformation.CfnPermissions.ResourceProperty(
                    table_resource=lakeformation.CfnPermissions.TableResourceProperty(
                        database_name=database_name, name=table_name
                    )
                ),
                permissions=["SELECT", "ALTER", "DROP", "INSERT", "DESCRIBE"],
            )
            grant_table_access.node.add_dependency(glue_database, tag_association)

            # Output the Glue table name using add_output
            add_output(
                self,
                f"TransformationGlueTableName-{table_name}",
                glue_table.ref,
            )

        glue_crawler_transformation = glue.CfnCrawler(
            self,
            "GlueCrawlerTransformation",
            name=f"{env_name}_transformation_crawler",
            role=crawler_role_transformation.role_arn,
            database_name=glue_database.ref,
            targets={
                "s3Targets": [{"path": f"s3://{transformation_bucket}/{env_name}/"}]
            },
        )

        # Output the Glue crawler name using add_output
        add_output(
            self,
            "GlueTransformationCrawlerName",
            glue_crawler_transformation.ref,
        )

        self.glue_crawler_staging = glue_crawler_staging
        self.glue_crawler_transformation = glue_crawler_transformation

If Lake Formation is enabled in your AWS account, make sure your CDK execution role has the necessary permissions to create and access Glue resources.
In my case, I’m using LF-Tags to grant the required permissions to the CDK execution role.

And ensure the Glue crawler role has the appropriate Lake Formation permissions.

8. Orchestrate the Flow Using Step Function
Now we can orchestrate the entire pipeline using AWS Step Functions. We want the Step Function to only complete when all steps run successfully.

For Glue jobs, it’s straightforward. Just add the following parameter to your task definition to ensure the Step Function waits for the job to complete before moving to the next step. integration_pattern=sfn.IntegrationPattern.RUN_JOB
For crawlers, unfortunately, we cannot use RUN_JOB

integration_pattern (Optional[IntegrationPattern]) – AWS Step Functions integrates with services directly in the Amazon States Language. You can control these AWS services using service integration patterns. Depending on the AWS Service, the Service Integration Pattern availability will vary. Default: - IntegrationPattern.REQUEST_RESPONSE for most tasks. IntegrationPattern.RUN_JOB for the following exceptions: BatchSubmitJob, EmrAddStep, EmrCreateCluster, EmrTerminationCluster, and EmrContainersStartJobRun.(https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_stepfunctions_tasks/GlueStartJobRun.html)

To handle this, we can use a Choice state and additional logic to wait until the crawler has actually completed. (https://repost.aws/questions/QUTgHzHs6bSN6n_79bCYohaQ/glue-crawler-in-state-machine-shows-as-complete-before-glue-data-catalog-is-updated)
StepFunction flow will look like

        # Define the ingestion Glue job task
        ingestion_glue_task = tasks.GlueStartJobRun(
            self,
            "IngestionGlueJob",
            glue_job_name=ingestion_glue_job_name,
            integration_pattern=sfn.IntegrationPattern.RUN_JOB,  # Wait for job completion
            arguments=sfn.TaskInput.from_object(
                {
                    "--file_path": sfn.JsonPath.string_at(
                        "$.file_path"
                    ),  # Pass file_path from input
                }
            ),
        )

        # Define the transformation Glue job task
        transformation_glue_task = tasks.GlueStartJobRun(
            self,
            "TransformationGlueJob",
            glue_job_name=transformation_glue_job_name,
            integration_pattern=sfn.IntegrationPattern.RUN_JOB,  # Wait for job completion
        )

        # Define the Glue crawler staging task
        glue_crawler_staging_task = tasks.CallAwsService(
            self,
            "GlueCrawlerStagingTask",
            service="glue",
            action="startCrawler",
            parameters={"Name": glue_crawler_staging_name},
            iam_resources=[
                f"arn:aws:glue:{region}:{account}:crawler/{glue_crawler_staging_name}"
            ],
        )

        # Define the Glue crawler transformation task
        glue_crawler_transformation_task = tasks.CallAwsService(
            self,
            "GlueCrawlerTransformationTask",
            service="glue",
            action="startCrawler",
            parameters={"Name": glue_crawler_transformation_name},
            iam_resources=[
                f"arn:aws:glue:{region}:{account}:crawler/{glue_crawler_transformation_name}"
            ],
        )

        # Define the SNS publish task
        sns_publish_task = tasks.CallAwsService(
            self,
            "SNSPublishTask",
            service="sns",
            action="publish",
            parameters={
                "TopicArn": sns_topic_arn,
                "Message": f"Step Function {env_name}-DataPipelineStateMachine has completed successfully.",
            },
            iam_resources=[f"arn:aws:sns:{region}:{account}:*"],
        )

        # Wait state for the staging crawler
        wait_staging = sfn.Wait(
            self,
            "WaitForStagingCrawler",
            time=sfn.WaitTime.duration(Duration.seconds(30)),
        )

        # GetCrawler state for the staging crawler
        get_staging_crawler = tasks.CallAwsService(
            self,
            "GetStagingCrawlerState",
            service="glue",
            action="getCrawler",
            parameters={"Name": glue_crawler_staging_name},
            iam_resources=[
                f"arn:aws:glue:{region}:{account}:crawler/{glue_crawler_staging_name}"
            ],
        )

        # Success and fail states for the staging crawler
        staging_success = sfn.Succeed(self, "StagingCrawlerSuccess")
        staging_failed = sfn.Fail(self, "StagingCrawlerFailed")

        # Choice state for the staging crawler
        staging_crawler_complete = sfn.Choice(self, "StagingCrawlerComplete")
        staging_crawler_complete.when(
            sfn.Condition.string_equals("$.Crawler.State", "READY"), staging_success
        )
        staging_crawler_complete.when(
            sfn.Condition.string_equals("$.Crawler.State", "FAILED"), staging_failed
        )
        staging_crawler_complete.otherwise(wait_staging)

        # Wait state for the transformation crawler
        wait_transformation = sfn.Wait(
            self,
            "WaitForTransformationCrawler",
            time=sfn.WaitTime.duration(Duration.seconds(30)),
        )

        # GetCrawler state for the transformation crawler
        get_transformation_crawler = tasks.CallAwsService(
            self,
            "GetTransformationCrawlerState",
            service="glue",
            action="getCrawler",
            parameters={"Name": glue_crawler_transformation_name},
            iam_resources=[
                f"arn:aws:glue:{region}:{account}:crawler/{glue_crawler_transformation_name}"
            ],
        )

        # Success and fail states for the transformation crawler
        transformation_success = sfn.Succeed(self, "TransformationCrawlerSuccess")
        transformation_failed = sfn.Fail(self, "TransformationCrawlerFailed")

        # Choice state for the transformation crawler
        transformation_crawler_complete = sfn.Choice(
            self, "TransformationCrawlerComplete"
        )
        transformation_crawler_complete.when(
            sfn.Condition.string_equals("$.Crawler.State", "READY"),
            transformation_success,
        )
        transformation_crawler_complete.when(
            sfn.Condition.string_equals("$.Crawler.State", "FAILED"),
            transformation_failed,
        )
        transformation_crawler_complete.otherwise(wait_transformation)

        # Run transformation Glue job and Glue crawler staging in parallel
        parallel_tasks = sfn.Parallel(self, "ParallelTasks")
        parallel_tasks.branch(
            transformation_glue_task.next(glue_crawler_transformation_task)
            .next(wait_transformation)
            .next(get_transformation_crawler)
            .next(transformation_crawler_complete)
        )
        parallel_tasks.branch(
            glue_crawler_staging_task.next(wait_staging)
            .next(get_staging_crawler)
            .next(staging_crawler_complete)
        )

        # Chain the ingestion Glue job, parallel tasks, transformation crawler, and SNS publish task
        definition = ingestion_glue_task.next(parallel_tasks).next(sns_publish_task)

        # Create the Step Function
        self.state_machine = sfn.StateMachine(
            self,
            f"{env_name}-DataPipelineStateMachine",
            definition=definition,
        )

To make the pipeline fully event-driven, we’ll add a trigger that starts the Step Function automatically whenever new files are added to the source bucket.

Set up an S3 Event Notification: Configure the source S3 bucket to send an event to a Lambda function whenever a new object is created.
Lambda function: Check if all required files (e.g., customer_data.csv, transaction_data.csv) are present in the bucket. If all required files are found, start the Step Function

        # Reference the existing S3 bucket
        input_bucket = s3.Bucket.from_bucket_name(
            self,
            "ExistingInputBucket",
            bucket_name=input_bucket_name,
        )

        # Create a Lambda function to trigger the Step Function
        trigger_lambda = _lambda.Function(
            self,
            "TriggerLambda",
            runtime=_lambda.Runtime.PYTHON_3_9,
            handler="trigger.handler",
            code=_lambda.Code.from_asset("./scripts/lambda/"),  # Path to Lambda code
            environment={
                "STEP_FUNCTION_ARN": self.state_machine.state_machine_arn,
                "REGION": region,
                "ACCOUNT": account,
                "BUCKET_NAME": input_bucket.bucket_name,
                "FILE_NAMES": ",".join(
                    file_names
                ),  # Convert list to comma-separated string
            },
        )

        # Grant permissions to the Lambda function
        input_bucket.grant_read(trigger_lambda)
        self.state_machine.grant_start_execution(trigger_lambda)

        # Add S3 event notification to invoke the Lambda function only for files in the env_name folder
        input_bucket.add_event_notification(
            s3.EventType.OBJECT_CREATED,
            s3_notifications.LambdaDestination(trigger_lambda),
            s3.NotificationKeyFilter(
                prefix=f"{env_name}/"
            ),  # Trigger only for files in env_name folder
        )

9. Deploy and Verify resources in AWS
Now it’s time to deploy everything to your AWS environment. Run the following command:
cdk deploy DataPipelineStack
Once the deployment is complete, go to the AWS CloudFormation Console to verify that all resources have been created successfully.

10. Upload data to S3 bucket
Once the resources are deployed, upload the sample data files (customer_data.csv and transaction_data.csv) to the source S3 bucket.
You can do this either via the AWS Console or by using the AWS CLI.

aws s3 cp /path/to/source s3://bucket-name/ --recursive
Make sure the files are placed under the correct prefix, which should match the environment name.
Once uploaded, the Lambda function will check for all required files and trigger the Step Function if everything is in place.

11. Monitor the Pipeline and Query Data with Athena
After uploading the data, you can monitor the entire process through AWS Step Function and check status of each step.
To stay informed, you can subscribe an email address or mobile phone number to the SNS topic.
This way, you’ll receive a notification when the Step Function completes successfully or if there’s any error during execution.

Finally, once the process is complete and the Glue catalog is updated, you can query the output data using Amazon Athena:

Quick Notes and Thoughts

If you see folders named like *_$folder$ in your output S3 bucket, don’t worry — this is a placeholder created by Hadoop when writing to a path that doesn’t yet exist. To avoid permission errors, make sure your Glue job role has the right permissions to create folders in the target S3 location.
On a side note: I love using Python — it’s one of my favourite languages But when it comes to AWS CDK, I often feel that TypeScript is more “native.” Maybe that’s because TypeScript was the first language supported by CDK? The documentation definitely feel more complete in TypeScript.

TypeScript was the first language supported by the AWS CDK, and much of the AWS CDK example code is written in TypeScript. This guide includes a topic specifically to show how to adapt TypeScript AWS CDK code for use with the other supported languages. For more information, see Comparing AWS CDK in TypeScript with other languages. (https://docs.aws.amazon.com/cdk/v2/guide/languages.html)
https://docs.aws.amazon.com/cdk/v2/guide/work-with.html#work-with-cdk-compare