DEV Community

Cover image for A Practical Guide to Building a Vision AI Model Serving Pipeline with AWS CDK

A Practical Guide to Building a Vision AI Model Serving Pipeline with AWS CDK

This article is based on a VAIaaS (Vision AI as a Service) project we developed for production deployment. I’ll share our experience building a Vision AI model serving pipeline using a combination of CDK, API Gateway, Lambda, and SageMaker.


Introduction

Developing a Vision AI model and deploying that model as an actual service are two entirely different challenges. Even a model that achieves high accuracy in a research environment faces numerous engineering challenges when transitioning it into a stable and scalable API service. From infrastructure provisioning, automated model deployment, and request processing pipeline design to cost optimization and operational monitoring—a unified DevOps framework is essential to manage all of these aspects systematically. When using AWS as your cloud infrastructure, AWS CDK is an excellent choice for this purpose.

In this article, I’ll share an architecture and CDK examples for serving models—specifically Vision models—on the AWS cloud. This isn’t just a simple tutorial; it honestly details the problems encountered in a production environment and the process of solving them.


2. Overall Architecture of the Vision AI Service

Below is the overall architecture diagram for the tentative “VAIaaS (Vision AI as a Service)” project, designed to serve various types of Vision AI models in an end-to-end manner.

The request flow is described step-by-step as follows.

When a client sends image data to the POST /v1/analyze endpoint, Amazon API Gateway receives the request. After the Lambda Authorizer validates the JWT token or API key, the Router Lambda routes the request. The Pre-processing Lambda uploads the image to S3 and performs preprocessing (resizing, normalization) to match the model’s input format. The preprocessed data is sent to the SageMaker real-time endpoint where inference is executed, and the results are converted into a client-friendly JSON format via the Post-processing Lambda and returned.

CDK Stack Structure

The entire infrastructure is divided into four CDK Stacks.

Stack Key Resources Role
StorageStack S3 (Input/Output/Model) Storage for data and model artifacts
ModelStack SageMaker Model, EndpointConfig, Endpoint AI model hosting
ApiStack API Gateway, Lambda x3 Request processing pipeline
ObservabilityStack CloudWatch, X-Ray Monitoring and tracing

The reason for separating the stacks is to enable independent deployment and rollback. When updating a model, only the ModelStack needs to be redeployed, and when modifying API logic, only the ApiStack needs to be updated. This is a critical design decision in a production environment.


3. StorageStack: Configuring the Data Layer

// lib/storage-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as s3 from 'aws-cdk-lib/aws-s3';
import { Construct } from 'constructs';

export class StorageStack extends cdk.Stack {
  // Expose it publicly so that it can be referenced from other stacks
  public readonly inputBucket: s3.Bucket;
  public readonly outputBucket: s3.Bucket;
  public readonly modelBucket: s3.Bucket;

  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Input Image Buckets: Cost Optimization with Lifecycle Policies
    this.inputBucket = new s3.Bucket(this, 'InputBucket', {
      bucketName: `vaiaas-input-${this.account}-${this.region}`,
      removalPolicy: cdk.RemovalPolicy.RETAIN,
      lifecycleRules: [
        {
          // Processed images are automatically deleted after 7 days
          expiration: cdk.Duration.days(7),
          prefix: 'processed/',
        },
      ],
      cors: [
        {
          allowedMethods: [s3.HttpMethods.PUT, s3.HttpMethods.POST],
          allowedOrigins: ['*'],
          allowedHeaders: ['*'],
        },
      ],
    });

    // Result Bucket: Cost Optimization with Intelligent-Tiering
    this.outputBucket = new s3.Bucket(this, 'OutputBucket', {
      bucketName: `vaiaas-output-${this.account}-${this.region}`,
      removalPolicy: cdk.RemovalPolicy.RETAIN,
      intelligentTieringConfigurations: [
        {
          name: 'EntireBucket',
          archiveAccessTierTime: cdk.Duration.days(90),
          deepArchiveAccessTierTime: cdk.Duration.days(180),
        },
      ],
    });

    // Model Artifact Bucket: Enable Version Control
    this.modelBucket = new s3.Bucket(this, 'ModelBucket', {
      bucketName: `vaiaas-models-${this.account}-${this.region}`,
      versioned: true,
      removalPolicy: cdk.RemovalPolicy.RETAIN,
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

Practical Tip: Including ${this.account}-${this.region} in your S3 bucket name can help prevent naming conflicts during multi-region deployments. Additionally, removalPolicy: RETAIN is an important setting that ensures your data is preserved even if you accidentally delete the stack.


4. ModelStack: Configuring SageMaker Endpoints

Configuring SageMaker endpoints is one of the most complex aspects of the CDK. You must manage the model artifact path, container image, instance type, and autoscaling policy entirely through code.

// lib/model-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as sagemaker from 'aws-cdk-lib/aws-sagemaker';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as s3 from 'aws-cdk-lib/aws-s3';
import { Construct } from 'constructs';

interface ModelStackProps extends cdk.StackProps {
  modelBucket: s3.Bucket;
  modelVersion: string; // e.g., 'v1.2.0'
}

export class ModelStack extends cdk.Stack {
  public readonly endpointName: string;

  constructor(scope: Construct, id: string, props: ModelStackProps) {
    super(scope, id, props);

    this.endpointName = `vaiaas-endpoint-${props.modelVersion.replace(/\./g, '-')}`;

    // SageMaker execution role
    const sagemakerRole = new iam.Role(this, 'SageMakerRole', {
      assumedBy: new iam.ServicePrincipal('sagemaker.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonSageMakerFullAccess'),
      ],
    });

    // Grant read permissions for the model bucket
    props.modelBucket.grantRead(sagemakerRole);

    // Define SageMaker model
    // Use PyTorch inference container (AWS Deep Learning Container)
    const model = new sagemaker.CfnModel(this, 'VisionAIModel', {
      modelName: `vaiaas-model-${props.modelVersion.replace(/\./g, '-')}`,
      executionRoleArn: sagemakerRole.roleArn,
      primaryContainer: {
        // AWS-provided PyTorch inference containers
        image: `763104351884.dkr.ecr.${this.region}.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker`,
        modelDataUrl: `s3://${props.modelBucket.bucketName}/models/${props.modelVersion}/model.tar.gz`,
        environment: {
          SAGEMAKER_PROGRAM: 'inference.py',
          SAGEMAKER_SUBMIT_DIRECTORY: '/opt/ml/model/code',
          MODEL_VERSION: props.modelVersion,
        },
      },
    });

    // Endpoint Configuration: GPU Instance + Data Capture
    const endpointConfig = new sagemaker.CfnEndpointConfig(this, 'EndpointConfig', {
      endpointConfigName: `vaiaas-config-${props.modelVersion.replace(/\./g, '-')}`,
      productionVariants: [
        {
          variantName: 'AllTraffic',
          modelName: model.modelName!,
          instanceType: 'ml.g4dn.xlarge', // NVIDIA T4 GPU
          initialInstanceCount: 1,
          initialVariantWeight: 1,
        },
      ],
      // Inference Data Capture: Utilization for Model Quality Monitoring
      dataCaptureConfig: {
        enableCapture: true,
        initialSamplingPercentage: 10,
        destinationS3Uri: `s3://${props.modelBucket.bucketName}/data-capture/`,
        captureOptions: [
          { captureMode: 'Input' },
          { captureMode: 'Output' },
        ],
      },
    });

    // Real-time endpoint deployment
    const endpoint = new sagemaker.CfnEndpoint(this, 'Endpoint', {
      endpointName: this.endpointName,
      endpointConfigName: endpointConfig.endpointConfigName!,
    });

    // Auto Scaling Settings (Application Auto Scaling)
    // Automatically adjusts between a minimum of 1 and a maximum of 4 instances based on traffic
    const scalingTarget = new cdk.CfnResource(this, 'ScalingTarget', {
      type: 'AWS::ApplicationAutoScaling::ScalableTarget',
      properties: {
        MaxCapacity: 4,
        MinCapacity: 1,
        ResourceId: `endpoint/${this.endpointName}/variant/AllTraffic`,
        ScalableDimension: 'sagemaker:variant:DesiredInstanceCount',
        ServiceNamespace: 'sagemaker',
        RoleARN: sagemakerRole.roleArn,
      },
    });
    scalingTarget.addDependency(endpoint);

    new cdk.CfnOutput(this, 'EndpointNameOutput', {
      value: this.endpointName,
      exportName: 'VaiaasSageMakerEndpointName',
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

Practical Tip: ml.g4dn.xlarge is an instance equipped with an NVIDIA T4 GPU, offering the best price-performance ratio for Vision AI inference. We initially used ml.g4dn.2xlarge, but after conducting actual load tests, we confirmed that ml.g4dn.xlarge was sufficient and switched to it. Be sure to measure your actual model’s memory usage and inference time before selecting an instance type.


5. ApiStack: Request Processing Pipeline

The API layer consists of three Lambda functions. Each Lambda function is designed according to the Single Responsibility Principle (SRP), allowing them to be deployed and tested independently.

// lib/api-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as apigateway from 'aws-cdk-lib/aws-apigateway';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as s3 from 'aws-cdk-lib/aws-s3';
import { Construct } from 'constructs';

interface ApiStackProps extends cdk.StackProps {
  inputBucket: s3.Bucket;
  outputBucket: s3.Bucket;
  sagemakerEndpointName: string;
}

export class ApiStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props: ApiStackProps) {
    super(scope, id, props);

    // Lambda Common Layer (common utilities, latest version of boto3, etc.)
    const commonLayer = new lambda.LayerVersion(this, 'CommonLayer', {
      code: lambda.Code.fromAsset('lambda/layers/common'),
      compatibleRuntimes: [lambda.Runtime.PYTHON_3_11],
      description: 'Common utilities and dependencies',
    });

    // 1. Lambda Authorizer: JWT Validation
    const authorizerFn = new lambda.Function(this, 'AuthorizerFunction', {
      runtime: lambda.Runtime.PYTHON_3_11,
      code: lambda.Code.fromAsset('lambda/authorizer'),
      handler: 'index.handler',
      environment: {
        JWT_SECRET_ARN: 'arn:aws:secretsmanager:...',
      },
      timeout: cdk.Duration.seconds(5),
    });

    const authorizer = new apigateway.TokenAuthorizer(this, 'JwtAuthorizer', {
      handler: authorizerFn,
      resultsCacheTtl: cdk.Duration.minutes(5), // Reducing latency through authentication result caching
    });

    // 2. Router Lambda: Request Routing and S3 Uploads
    const routerFn = new lambda.Function(this, 'RouterFunction', {
      runtime: lambda.Runtime.PYTHON_3_11,
      code: lambda.Code.fromAsset('lambda/router'),
      handler: 'index.handler',
      layers: [commonLayer],
      environment: {
        INPUT_BUCKET: props.inputBucket.bucketName,
        OUTPUT_BUCKET: props.outputBucket.bucketName,
        SAGEMAKER_ENDPOINT: props.sagemakerEndpointName,
      },
      timeout: cdk.Duration.seconds(30),
      memorySize: 512,
      tracing: lambda.Tracing.ACTIVE, // Enable X-Ray Tracking
    });

    // Granting S3 and SageMaker permissions
    props.inputBucket.grantReadWrite(routerFn);
    props.outputBucket.grantWrite(routerFn);
    routerFn.addToRolePolicy(new iam.PolicyStatement({
      actions: ['sagemaker:InvokeEndpoint'],
      resources: [`arn:aws:sagemaker:${this.region}:${this.account}:endpoint/${props.sagemakerEndpointName}`],
    }));

    // 3. API Gateway Configuration
    const api = new apigateway.RestApi(this, 'VaiaaasApi', {
      restApiName: 'VAIaaS API',
      description: 'Vision AI as a Service REST API',
      deployOptions: {
        stageName: 'v1',
        tracingEnabled: true,
        metricsEnabled: true,
        loggingLevel: apigateway.MethodLoggingLevel.INFO,
        // Throttling: 100 requests per second, with a burst of 200 requests
        throttlingRateLimit: 100,
        throttlingBurstLimit: 200,
      },
      // CORS Configuration
      defaultCorsPreflightOptions: {
        allowOrigins: apigateway.Cors.ALL_ORIGINS,
        allowMethods: ['POST', 'OPTIONS'],
        allowHeaders: ['Content-Type', 'Authorization'],
      },
    });

    // POST /v1/analyze endpoint
    const analyzeResource = api.root.addResource('analyze');
    analyzeResource.addMethod(
      'POST',
      new apigateway.LambdaIntegration(routerFn, {
        timeout: cdk.Duration.seconds(29), // The maximum timeout for the API Gateway is 29 seconds
      }),
      {
        authorizer,
        authorizationType: apigateway.AuthorizationType.CUSTOM,
        // Request validation: Content-Type header required
        requestValidator: new apigateway.RequestValidator(this, 'RequestValidator', {
          restApi: api,
          validateRequestBody: true,
          validateRequestParameters: true,
        }),
      }
    );

    new cdk.CfnOutput(this, 'ApiUrl', {
      value: api.url,
      description: 'VAIaaS API Gateway URL',
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

6. Issues Encountered in Practice

Issue 1: SageMaker Endpoint Cold Start

SageMaker real-time endpoints do not have cold start issues because the instances are always running. However, during the initial deployment, it takes 5–15 minutes for the endpoint to transition from the Creating state to the InService state. This wait time is included during CDK deployment, which extends the total deployment time.

As a solution, we set the CDK’s waitForDeployment option to false and adopted a method of polling the endpoint status in a separate deployment pipeline.

Issue 2: Timeout When Calling SageMaker from Lambda

The maximum integration timeout for API Gateway is 29 seconds. In some cases, Vision AI inference for high-resolution images exceeded this limit.

We implemented two parallel approaches as a solution. First, we resized images to the model’s input size (e.g., 640×640) in a pre-processing Lambda function to reduce inference time. Second, we introduced an asynchronous processing pattern for requests with long processing times. The client receives a request ID immediately and later polls for the results, which are processed asynchronously via SQS.

Issue 3: SageMaker Endpoint Cost Optimization

The ml.g4dn.xlarge instance costs approximately $0.736 per hour. Running it 24 hours a day results in a monthly cost of about $530. The issue was that costs were incurred even during nighttime hours when there was no traffic.

We considered SageMaker Serverless Inference as a solution, but since it does not support GPUs, we had to switch to CPU inference. Ultimately, we analyzed traffic patterns and implemented a scheduling strategy that sets the MinCapacity for autoscaling to 0 during nighttime hours. This reduced monthly costs by approximately 40%.


7. Deployment and Operations

CDK Deployment Command

# Deploy the entire stack (for the initial deployment)
cdk bootstrap aws://ACCOUNT_ID/REGION
cdk deploy --all

# Deploy a specific stack (when updating a model)
cdk deploy ModelStack --parameters modelVersion=v1.3.0

# Review changes before deployment
cdk diff ApiStack
Enter fullscreen mode Exit fullscreen mode

Model Update Workflow

When deploying a new model version, use the Blue/Green deployment strategy. When updating an endpoint, SageMaker prepares new instances while keeping the existing ones running, and switches traffic once the new instances are ready. In the CDK, you can implement this by creating a new EndpointConfig and updating the endpointConfigName of the Endpoint.

Monitoring Dashboard

Monitor the following metrics on the CloudWatch dashboard.

Metric Threshold Alarm Condition
SageMaker ModelLatency 5,000ms When P99 is exceeded
SageMaker Invocations - Track calls per minute
Lambda Duration 25,000ms When P95 is exceeded
API Gateway 5XXError 1% When error rate is exceeded
SageMaker CPUUtilization 80% Auto-scaling trigger

Conclusion

While building a pipeline for Vision AI serving using AWS CDK, the most significant insight I gained was that managing infrastructure as code is not merely a matter of convenience, but a matter of service quality. Thanks to CDK, model updates, infrastructure changes, and environment replication all went through a code review process, which greatly improved operational stability.

In the next post, I plan to cover how to integrate SageMaker Model Monitor into this pipeline to detect model drift in a production environment.

I hope this post is helpful to those looking to build services using Vision AI models. Please leave any questions or feedback in the comments!


References

Top comments (0)