DEV Community: Yeonggyoo Jeon

The Struggle to Optimize the Performance of the NVIDIA Triton Inference Server Running on AWS ECS

Yeonggyoo Jeon — Thu, 23 Apr 2026 15:19:35 +0000

“Why is it so slow even though I have a GPU?” I’d like to share my three-week struggle, which began with this single question.

Introduction

While developing the Vision AI service, I chose Nvidia Triton Inference Server as the framework for model serving. Its features—such as multi-framework support, dynamic batching, and ensemble pipelines—were excellent, and I was particularly drawn to its ability to fully leverage NVIDIA GPUs.

For the deployment environment, We decided to use SageMaker Inference for the deployment environment, while also utilizing AWS ECS depending on the environment of the inference models and code. I was already familiar with ECS from previous experience, and there was a requirement to expose Triton’s gRPC endpoints directly. However, once we actually deployed Triton on ECS, we encountered some unexpected issues.

This post documents the three main issues we faced and how we resolved them during that process.

Deployment Environment Overview

First, here is a brief overview of the overall architecture.

[Triton on ECS Architecture]

The main components are as follows. We deploy the Triton container as an ECS service on an ECS cluster composed of GPU instances (g4dn.xlarge, NVIDIA T4). The model files are stored in S3 and loaded from S3 when Triton starts. We route HTTP (:8000) and gRPC (:8001) traffic through ALB and monitor GPU metrics using Prometheus and Grafana.

The key settings for the ECS Task Definition are as follows.

{
  “containerDefinitions”: [
    {
      “name”: “triton-server”,
      ‘image’: “nvcr.io/nvidia/tritonserver:23.10-py3”,

“command”: [
        ‘tritonserver’,
        “--model-repository=s3://my-bucket/model_repository”,
        “--allow-grpc=true”,
        “--grpc-port=8001”,

“--allow-http=true”,
        “--http-port=8000”,
        “--allow-metrics=true”,
        “--metrics-port=8002”

],
      “portMappings”: [
        { “containerPort”: 8000, “protocol”: “tcp” },
        { “containerPort”: 8001, ‘protocol’: “tcp” },

{ “containerPort”: 8002, “protocol”: “tcp” }
      ],
      “resourceRequirements”: [
        {
          “type”: “GPU”,
          ‘value’: “1”

}
      ],
      “logConfiguration”: {
        “logDriver”: “awslogs”,
        ‘options’: {
          “awslogs-group”: “/ecs/triton-server”,
          “awslogs-region”: “ap-northeast-2”,

“awslogs-stream-prefix”: “triton”
        }
      }
    }
  ]
}

Issue 1: GPU Not Detected

Symptoms

The ECS task appeared to be running normally, but the following warning continued to appear in the Triton logs.

W0115 09:23:41.123456 1 backend_manager.cc:295]

Unable to load backend ‘tensorrt’: 
  failed to load library /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so: 
  libcuda.so.1: cannot open shared object file: No such file or directory

The model had loaded, but inference was running on the CPU instead of the GPU. When I ran nvidia-smi inside the container, there was no output.

Root Cause Analysis

The issue lay in the instance configuration of the ECS cluster. Although we were using a GPU instance (g4dn.xlarge), we were using a standard ECS-Optimized AMI instead of an ECS-Optimized GPU AMI.

To use GPUs in ECS, both of the following conditions must be met.

Condition	Description
ECS-Optimized GPU AMI	An AMI with NVIDIA drivers and `nvidia-container-toolkit` pre-installed
`resourceRequirements` in Task Definition	GPU resources must be explicitly requested for ECS to allocate a GPU to the container

Since the standard ECS AMI does not have NVIDIA drivers installed, the container was unable to recognize the GPU.

Resolution

In the AWS Console, I replaced the Auto Scaling Group AMI for the ECS cluster with ami-xxxxxxxx (ECS-Optimized GPU AMI). Here’s how to check the latest GPU AMI ID using the AWS CLI:

# Check the latest ECS-Optimized GPU AMI ID for the current region
aws ssm get-parameters \
  --names /aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended/image_id \
  --region ap-northeast-2 \
  --query “Parameters[0].Value” \
  --output text

After replacing the AMI and restarting the instance, nvidia-smi recognized the GPU normally, and Triton successfully loaded the TensorRT backend.

Key Takeaway: When using GPUs in ECS, you must use an ECS-Optimized GPU AMI. While it is possible to manually install NVIDIA drivers on a standard ECS AMI, this is not recommended because the drivers may be reset during AMI updates.

Issue 2: Throughput is Only One-Third of Expectations

Symptoms

After resolving the GPU recognition issue, I ran a load test. The results measured by perf_analyzer were shocking.

# perf_analyzer execution results (Dynamic Batching OFF)
Concurrency: 8
  Throughput: 22.3 infer/sec
  Latency p50: 178ms

Latency p95: 312ms
  Latency p99: 445ms

The inference throughput of 22 requests per second was only one-third of the expected value (~70 req/s). Upon checking the GPU utilization, it was only 18% on average. The GPU was sitting idle despite being available.

Root Cause Analysis

The problem was that Dynamic Batching was disabled. Since each inference request was being sent to the GPU individually, we were not utilizing the GPU’s parallel processing capabilities at all.

GPUs are specialized for parallel matrix operations. The time difference between running inference with batch=1 and batch=8 is not as significant as one might think. In other words, by grouping multiple requests and processing them at once, we can simultaneously increase both GPU utilization and throughput.

[Comparison of Dynamic Batching Effects]

Solution

We added the Dynamic Batching configuration to the model’s config.pbtxt.

# model_repository/vision_model/config.pbtxt

name: “vision_model”
backend: ‘onnxruntime’
max_batch_size: 8

input [
  {
    name: “input_image”

data_type: TYPE_FP32
    dims: [ 3, 640, 640 ]
  }
]

output [
  {
    name: “output_detections”
    data_type: TYPE_FP32
    dims: [ -1, 6 ]
  }
]

# Enable Dynamic Batching
dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 5000 # Execute batch after a 5ms wait
}

# Number of concurrent model instances (within the limits of available GPU memory)
instance_group [
  {
    count: 1

kind: KIND_GPU
  }
]

max_queue_delay_microseconds: 5000 means the system waits up to 5ms to fill a batch. If this value is too large, latency increases; if it is too small, batch efficiency decreases. For our service, 5ms was the optimal balance between throughput and latency.

The results after changing the settings are as follows.

# perf_analyzer execution results (Dynamic Batching ON)
Concurrency: 8
  Throughput: 76.1 infer/sec (+241%)
  Latency p50: 98ms (-45%)
  Latency p95: 187ms (-40%)

Latency p99: 234ms (-47%)

Throughput improved by 3.5 times, from 22 req/s to 76 req/s, and GPU utilization also increased from 18% to 72%.

Key Takeaway: Enable Dynamic Batching by default when deploying Triton for the first time. Set preferred_batch_size based on the model’s max_batch_size and actual traffic patterns, and adjust max_queue_delay_microseconds to meet your service’s latency SLA.

Issue 3: ECS tasks periodically crash due to OOM

Symptoms

After enabling Dynamic Batching and running it for a few days, we observed that ECS tasks were restarting periodically. Checking the CloudWatch logs revealed that the cause was **OOMKilled (Out of Memory)**.

# CloudWatch Logs
[ERROR] Container ‘triton-server’ failed with exit code 137 (OOMKilled)

What was strange was that it was not GPU memory but **CPU memory (RAM)** that was running low. The g4dn.xlarge instance provides 16GB of RAM, but the Triton container was using over 12GB.

Root Cause Analysis

The issue lay in how Triton’s CUDA Unified Memory operates. By default, Triton uses CPU memory as a fallback when GPU memory is insufficient. It also caches model weights in CPU memory during model loading.

In the case of our model, the ONNX file size was approximately 800MB, and Triton was excessively using CPU memory by copying it multiple times internally. In particular, enabling Dynamic Batching resulted in additional intermediate buffers being allocated for batch processing.

Solution

We resolved the issue by combining three approaches.

1. Adjusting Memory Limits in the ECS Task Definition

We explicitly set the soft limit (memoryReservation) and hard limit (memory) for the ECS Task.

{
  “name”: “triton-server”,
  “memory”: 14336,
  “memoryReservation”: 10240,
  “resourceRequirements”: [{ “type”: “GPU”, ‘value’: “1” }]
}

2. Limiting Triton’s GPU Memory Usage

We explicitly specified the GPU memory pool size using the --cuda-memory-pool-byte-size option.

tritonserver \
  --model-repository=s3://my-bucket/model_repository \
  --cuda-memory-pool-byte-size=0:3221225472 \ # Allocate 3GB to GPU 0

--pinned-memory-pool-byte-size=1073741824 \ # 1GB of pinned memory
  --allow-grpc=true \
  --allow-http=true

3. Add memory optimization settings to the model's config.pbtxt

# Add to model_repository/vision_model/config.pbtxt
optimization {
  cuda {
    graphs: true # Reduce kernel launch overhead using CUDA Graphs
  }
}

# Disable CPU fallback when GPU memory is insufficient (force explicit failure)
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

After applying these three measures, memory usage stabilized, and restarts caused by OOM completely disappeared.

Key Takeaway: When deploying Triton on ECS, you must monitor CPU memory usage as well as GPU memory. It is particularly important to explicitly control memory usage with the --cuda-memory-pool-byte-size and --pinned-memory-pool-byte-size options when serving large models.

Final Performance Comparison

The final performance results after resolving all three issues are summarized below.

Metric	Initial (Issue State)	Final (After Optimization)	Improvement Rate
Throughput (req/s)	22.3	76.1	+241%
P50 Latency	178ms	98ms	-45%
P99 Latency	445ms	234ms	-47%
GPU Utilization	18%	72%	+300%
OOM Restarts	2–3 times per week	0 times	Fully Resolved

Conclusion

Triton Inference Server is a powerful tool, but to use it effectively in an ECS environment, you need to avoid a few pitfalls. The three issues covered in this article—failed GPU detection, Dynamic Batching not enabled, and CPU memory OOM—are all mentioned in the official documentation, but they are easy to overlook until you actually encounter them.

In particular, Dynamic Batching is one of Triton’s most powerful features, so it’s unfortunate that it is disabled by default. If you are deploying Triton for the first time, please be sure to check this setting.

In the next post, we will cover how to use Triton’s Model Ensemble to handle the preprocessing-inference-postprocessing pipeline on the server side.

References

A Practical Guide to Building a Vision AI Model Serving Pipeline with AWS CDK

Yeonggyoo Jeon — Thu, 23 Apr 2026 14:50:54 +0000

This article is based on a VAIaaS (Vision AI as a Service) project we developed for production deployment. I’ll share our experience building a Vision AI model serving pipeline using a combination of CDK, API Gateway, Lambda, and SageMaker.

Introduction

Developing a Vision AI model and deploying that model as an actual service are two entirely different challenges. Even a model that achieves high accuracy in a research environment faces numerous engineering challenges when transitioning it into a stable and scalable API service. From infrastructure provisioning, automated model deployment, and request processing pipeline design to cost optimization and operational monitoring—a unified DevOps framework is essential to manage all of these aspects systematically. When using AWS as your cloud infrastructure, AWS CDK is an excellent choice for this purpose.

In this article, I’ll share an architecture and CDK examples for serving models—specifically Vision models—on the AWS cloud. This isn’t just a simple tutorial; it honestly details the problems encountered in a production environment and the process of solving them.

2. Overall Architecture of the Vision AI Service

Below is the overall architecture diagram for the tentative “VAIaaS (Vision AI as a Service)” project, designed to serve various types of Vision AI models in an end-to-end manner.

The request flow is described step-by-step as follows.

When a client sends image data to the POST /v1/analyze endpoint, Amazon API Gateway receives the request. After the Lambda Authorizer validates the JWT token or API key, the Router Lambda routes the request. The Pre-processing Lambda uploads the image to S3 and performs preprocessing (resizing, normalization) to match the model’s input format. The preprocessed data is sent to the SageMaker real-time endpoint where inference is executed, and the results are converted into a client-friendly JSON format via the Post-processing Lambda and returned.

CDK Stack Structure

The entire infrastructure is divided into four CDK Stacks.

Stack	Key Resources	Role
`StorageStack`	S3 (Input/Output/Model)	Storage for data and model artifacts
`ModelStack`	SageMaker Model, EndpointConfig, Endpoint	AI model hosting
`ApiStack`	API Gateway, Lambda x3	Request processing pipeline
`ObservabilityStack`	CloudWatch, X-Ray	Monitoring and tracing

The reason for separating the stacks is to enable independent deployment and rollback. When updating a model, only the ModelStack needs to be redeployed, and when modifying API logic, only the ApiStack needs to be updated. This is a critical design decision in a production environment.

3. StorageStack: Configuring the Data Layer

// lib/storage-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as s3 from 'aws-cdk-lib/aws-s3';
import { Construct } from 'constructs';

export class StorageStack extends cdk.Stack {
  // Expose it publicly so that it can be referenced from other stacks
  public readonly inputBucket: s3.Bucket;
  public readonly outputBucket: s3.Bucket;
  public readonly modelBucket: s3.Bucket;

  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Input Image Buckets: Cost Optimization with Lifecycle Policies
    this.inputBucket = new s3.Bucket(this, 'InputBucket', {
      bucketName: `vaiaas-input-${this.account}-${this.region}`,
      removalPolicy: cdk.RemovalPolicy.RETAIN,
      lifecycleRules: [
        {
          // Processed images are automatically deleted after 7 days
          expiration: cdk.Duration.days(7),
          prefix: 'processed/',
        },
      ],
      cors: [
        {
          allowedMethods: [s3.HttpMethods.PUT, s3.HttpMethods.POST],
          allowedOrigins: ['*'],
          allowedHeaders: ['*'],
        },
      ],
    });

    // Result Bucket: Cost Optimization with Intelligent-Tiering
    this.outputBucket = new s3.Bucket(this, 'OutputBucket', {
      bucketName: `vaiaas-output-${this.account}-${this.region}`,
      removalPolicy: cdk.RemovalPolicy.RETAIN,
      intelligentTieringConfigurations: [
        {
          name: 'EntireBucket',
          archiveAccessTierTime: cdk.Duration.days(90),
          deepArchiveAccessTierTime: cdk.Duration.days(180),
        },
      ],
    });

    // Model Artifact Bucket: Enable Version Control
    this.modelBucket = new s3.Bucket(this, 'ModelBucket', {
      bucketName: `vaiaas-models-${this.account}-${this.region}`,
      versioned: true,
      removalPolicy: cdk.RemovalPolicy.RETAIN,
    });
  }
}

Practical Tip: Including ${this.account}-${this.region} in your S3 bucket name can help prevent naming conflicts during multi-region deployments. Additionally, removalPolicy: RETAIN is an important setting that ensures your data is preserved even if you accidentally delete the stack.

4. ModelStack: Configuring SageMaker Endpoints

Configuring SageMaker endpoints is one of the most complex aspects of the CDK. You must manage the model artifact path, container image, instance type, and autoscaling policy entirely through code.

// lib/model-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as sagemaker from 'aws-cdk-lib/aws-sagemaker';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as s3 from 'aws-cdk-lib/aws-s3';
import { Construct } from 'constructs';

interface ModelStackProps extends cdk.StackProps {
  modelBucket: s3.Bucket;
  modelVersion: string; // e.g., 'v1.2.0'
}

export class ModelStack extends cdk.Stack {
  public readonly endpointName: string;

  constructor(scope: Construct, id: string, props: ModelStackProps) {
    super(scope, id, props);

    this.endpointName = `vaiaas-endpoint-${props.modelVersion.replace(/\./g, '-')}`;

    // SageMaker execution role
    const sagemakerRole = new iam.Role(this, 'SageMakerRole', {
      assumedBy: new iam.ServicePrincipal('sagemaker.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonSageMakerFullAccess'),
      ],
    });

    // Grant read permissions for the model bucket
    props.modelBucket.grantRead(sagemakerRole);

    // Define SageMaker model
    // Use PyTorch inference container (AWS Deep Learning Container)
    const model = new sagemaker.CfnModel(this, 'VisionAIModel', {
      modelName: `vaiaas-model-${props.modelVersion.replace(/\./g, '-')}`,
      executionRoleArn: sagemakerRole.roleArn,
      primaryContainer: {
        // AWS-provided PyTorch inference containers
        image: `763104351884.dkr.ecr.${this.region}.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker`,
        modelDataUrl: `s3://${props.modelBucket.bucketName}/models/${props.modelVersion}/model.tar.gz`,
        environment: {
          SAGEMAKER_PROGRAM: 'inference.py',
          SAGEMAKER_SUBMIT_DIRECTORY: '/opt/ml/model/code',
          MODEL_VERSION: props.modelVersion,
        },
      },
    });

    // Endpoint Configuration: GPU Instance + Data Capture
    const endpointConfig = new sagemaker.CfnEndpointConfig(this, 'EndpointConfig', {
      endpointConfigName: `vaiaas-config-${props.modelVersion.replace(/\./g, '-')}`,
      productionVariants: [
        {
          variantName: 'AllTraffic',
          modelName: model.modelName!,
          instanceType: 'ml.g4dn.xlarge', // NVIDIA T4 GPU
          initialInstanceCount: 1,
          initialVariantWeight: 1,
        },
      ],
      // Inference Data Capture: Utilization for Model Quality Monitoring
      dataCaptureConfig: {
        enableCapture: true,
        initialSamplingPercentage: 10,
        destinationS3Uri: `s3://${props.modelBucket.bucketName}/data-capture/`,
        captureOptions: [
          { captureMode: 'Input' },
          { captureMode: 'Output' },
        ],
      },
    });

    // Real-time endpoint deployment
    const endpoint = new sagemaker.CfnEndpoint(this, 'Endpoint', {
      endpointName: this.endpointName,
      endpointConfigName: endpointConfig.endpointConfigName!,
    });

    // Auto Scaling Settings (Application Auto Scaling)
    // Automatically adjusts between a minimum of 1 and a maximum of 4 instances based on traffic
    const scalingTarget = new cdk.CfnResource(this, 'ScalingTarget', {
      type: 'AWS::ApplicationAutoScaling::ScalableTarget',
      properties: {
        MaxCapacity: 4,
        MinCapacity: 1,
        ResourceId: `endpoint/${this.endpointName}/variant/AllTraffic`,
        ScalableDimension: 'sagemaker:variant:DesiredInstanceCount',
        ServiceNamespace: 'sagemaker',
        RoleARN: sagemakerRole.roleArn,
      },
    });
    scalingTarget.addDependency(endpoint);

    new cdk.CfnOutput(this, 'EndpointNameOutput', {
      value: this.endpointName,
      exportName: 'VaiaasSageMakerEndpointName',
    });
  }
}

Practical Tip: ml.g4dn.xlarge is an instance equipped with an NVIDIA T4 GPU, offering the best price-performance ratio for Vision AI inference. We initially used ml.g4dn.2xlarge, but after conducting actual load tests, we confirmed that ml.g4dn.xlarge was sufficient and switched to it. Be sure to measure your actual model’s memory usage and inference time before selecting an instance type.

5. ApiStack: Request Processing Pipeline

The API layer consists of three Lambda functions. Each Lambda function is designed according to the Single Responsibility Principle (SRP), allowing them to be deployed and tested independently.

// lib/api-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as apigateway from 'aws-cdk-lib/aws-apigateway';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as s3 from 'aws-cdk-lib/aws-s3';
import { Construct } from 'constructs';

interface ApiStackProps extends cdk.StackProps {
  inputBucket: s3.Bucket;
  outputBucket: s3.Bucket;
  sagemakerEndpointName: string;
}

export class ApiStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props: ApiStackProps) {
    super(scope, id, props);

    // Lambda Common Layer (common utilities, latest version of boto3, etc.)
    const commonLayer = new lambda.LayerVersion(this, 'CommonLayer', {
      code: lambda.Code.fromAsset('lambda/layers/common'),
      compatibleRuntimes: [lambda.Runtime.PYTHON_3_11],
      description: 'Common utilities and dependencies',
    });

    // 1. Lambda Authorizer: JWT Validation
    const authorizerFn = new lambda.Function(this, 'AuthorizerFunction', {
      runtime: lambda.Runtime.PYTHON_3_11,
      code: lambda.Code.fromAsset('lambda/authorizer'),
      handler: 'index.handler',
      environment: {
        JWT_SECRET_ARN: 'arn:aws:secretsmanager:...',
      },
      timeout: cdk.Duration.seconds(5),
    });

    const authorizer = new apigateway.TokenAuthorizer(this, 'JwtAuthorizer', {
      handler: authorizerFn,
      resultsCacheTtl: cdk.Duration.minutes(5), // Reducing latency through authentication result caching
    });

    // 2. Router Lambda: Request Routing and S3 Uploads
    const routerFn = new lambda.Function(this, 'RouterFunction', {
      runtime: lambda.Runtime.PYTHON_3_11,
      code: lambda.Code.fromAsset('lambda/router'),
      handler: 'index.handler',
      layers: [commonLayer],
      environment: {
        INPUT_BUCKET: props.inputBucket.bucketName,
        OUTPUT_BUCKET: props.outputBucket.bucketName,
        SAGEMAKER_ENDPOINT: props.sagemakerEndpointName,
      },
      timeout: cdk.Duration.seconds(30),
      memorySize: 512,
      tracing: lambda.Tracing.ACTIVE, // Enable X-Ray Tracking
    });

    // Granting S3 and SageMaker permissions
    props.inputBucket.grantReadWrite(routerFn);
    props.outputBucket.grantWrite(routerFn);
    routerFn.addToRolePolicy(new iam.PolicyStatement({
      actions: ['sagemaker:InvokeEndpoint'],
      resources: [`arn:aws:sagemaker:${this.region}:${this.account}:endpoint/${props.sagemakerEndpointName}`],
    }));

    // 3. API Gateway Configuration
    const api = new apigateway.RestApi(this, 'VaiaaasApi', {
      restApiName: 'VAIaaS API',
      description: 'Vision AI as a Service REST API',
      deployOptions: {
        stageName: 'v1',
        tracingEnabled: true,
        metricsEnabled: true,
        loggingLevel: apigateway.MethodLoggingLevel.INFO,
        // Throttling: 100 requests per second, with a burst of 200 requests
        throttlingRateLimit: 100,
        throttlingBurstLimit: 200,
      },
      // CORS Configuration
      defaultCorsPreflightOptions: {
        allowOrigins: apigateway.Cors.ALL_ORIGINS,
        allowMethods: ['POST', 'OPTIONS'],
        allowHeaders: ['Content-Type', 'Authorization'],
      },
    });

    // POST /v1/analyze endpoint
    const analyzeResource = api.root.addResource('analyze');
    analyzeResource.addMethod(
      'POST',
      new apigateway.LambdaIntegration(routerFn, {
        timeout: cdk.Duration.seconds(29), // The maximum timeout for the API Gateway is 29 seconds
      }),
      {
        authorizer,
        authorizationType: apigateway.AuthorizationType.CUSTOM,
        // Request validation: Content-Type header required
        requestValidator: new apigateway.RequestValidator(this, 'RequestValidator', {
          restApi: api,
          validateRequestBody: true,
          validateRequestParameters: true,
        }),
      }
    );

    new cdk.CfnOutput(this, 'ApiUrl', {
      value: api.url,
      description: 'VAIaaS API Gateway URL',
    });
  }
}

6. Issues Encountered in Practice

Issue 1: SageMaker Endpoint Cold Start

SageMaker real-time endpoints do not have cold start issues because the instances are always running. However, during the initial deployment, it takes 5–15 minutes for the endpoint to transition from the Creating state to the InService state. This wait time is included during CDK deployment, which extends the total deployment time.

As a solution, we set the CDK’s waitForDeployment option to false and adopted a method of polling the endpoint status in a separate deployment pipeline.

Issue 2: Timeout When Calling SageMaker from Lambda

The maximum integration timeout for API Gateway is 29 seconds. In some cases, Vision AI inference for high-resolution images exceeded this limit.

We implemented two parallel approaches as a solution. First, we resized images to the model’s input size (e.g., 640×640) in a pre-processing Lambda function to reduce inference time. Second, we introduced an asynchronous processing pattern for requests with long processing times. The client receives a request ID immediately and later polls for the results, which are processed asynchronously via SQS.

Issue 3: SageMaker Endpoint Cost Optimization

The ml.g4dn.xlarge instance costs approximately $0.736 per hour. Running it 24 hours a day results in a monthly cost of about $530. The issue was that costs were incurred even during nighttime hours when there was no traffic.

We considered SageMaker Serverless Inference as a solution, but since it does not support GPUs, we had to switch to CPU inference. Ultimately, we analyzed traffic patterns and implemented a scheduling strategy that sets the MinCapacity for autoscaling to 0 during nighttime hours. This reduced monthly costs by approximately 40%.

7. Deployment and Operations

CDK Deployment Command

# Deploy the entire stack (for the initial deployment)
cdk bootstrap aws://ACCOUNT_ID/REGION
cdk deploy --all

# Deploy a specific stack (when updating a model)
cdk deploy ModelStack --parameters modelVersion=v1.3.0

# Review changes before deployment
cdk diff ApiStack

Model Update Workflow

When deploying a new model version, use the Blue/Green deployment strategy. When updating an endpoint, SageMaker prepares new instances while keeping the existing ones running, and switches traffic once the new instances are ready. In the CDK, you can implement this by creating a new EndpointConfig and updating the endpointConfigName of the Endpoint.

Monitoring Dashboard

Monitor the following metrics on the CloudWatch dashboard.

Metric	Threshold	Alarm Condition
SageMaker `ModelLatency`	5,000ms	When P99 is exceeded
SageMaker `Invocations`	-	Track calls per minute
Lambda `Duration`	25,000ms	When P95 is exceeded
API Gateway `5XXError`	1%	When error rate is exceeded
SageMaker `CPUUtilization`	80%	Auto-scaling trigger

Conclusion

While building a pipeline for Vision AI serving using AWS CDK, the most significant insight I gained was that managing infrastructure as code is not merely a matter of convenience, but a matter of service quality. Thanks to CDK, model updates, infrastructure changes, and environment replication all went through a code review process, which greatly improved operational stability.

In the next post, I plan to cover how to integrate SageMaker Model Monitor into this pipeline to detect model drift in a production environment.

I hope this post is helpful to those looking to build services using Vision AI models. Please leave any questions or feedback in the comments!

References

AWS Amplify vs Netlify comparison for hosting static websites

Yeonggyoo Jeon — Sun, 31 Mar 2024 11:55:40 +0000

Before we get into

When you're working on a software development project, you might need a simple demo webpage or demo app to demonstrate your technology or do a POC. Static web pages that are organized as a simple Single Page App (SPA) are useful at this time. There are many web development frameworks that help you implement SPA.(Next.js, Gatsby, Hugo, Nuxt, Jekyll, etc...) You can create a demo app with SPA and host it on the web as a simple static web page to ensure accessibility. There are many services that support various frameworks that can be used for this (Netlify, StackBlitx, AWS Amplify, etc...). There are also services (including frameworks) such as Streamlit and Gradio that allow you to create a SPA-based demo web with Python and host it on the web. These services specialize in rapid web app development for machine learning models for services (I'll cover them in another article when I get a chance).

Configuring the demo app with Next.js

Next.js is an open source framework for developing React-based web applications. It offers a wide range of features such as server-side rendering (SSR), static site generation (SSG), automatic code splitting, image optimization, and more to enable fast and user-friendly web app development.

Key features of Next.js:

Server-side rendering: Render HTML, CSS, and JavaScript on the server before sending the page to the client to speed up page loading.
Static site generation: Deploy your web app as a static site by generating HTML, CSS, and JavaScript as static files during the build phase.
Automatic code splitting: Load only the code you need to speed up page loading.
Image optimization: Automatically optimize images to improve the performance of your web app.
SEO-friendly: Server-side rendering makes it easy for search engine crawlers to index your pages.
Multiple routing options: Along with default routing, we support file-based routing, dynamic routing, catchall routing, and more.
API routing: Easily create and manage server-side APIs.
Community support: We have an active community and a wealth of documentation and tutorials.

Next.js is right for you if you:
You want to develop fast, user-friendly web apps.

To create SEO-optimized web apps
To develop React-based web apps
Framework that offers a wide range of features
Next.js is a powerful framework that is easy to learn, easy to use, and offers a wide range of features that can greatly simplify web app development.

Learn more about Next.js:

Official documentation: https://nextjs.org/docs/
Tutorials: https://nextjs.org/learn/
Community: https://discord.gg/nextjs

Hume AI's demo app

Hume AI is a company that provides an API that recognizes human emotions from text, voice, and vision using AI technology. They have a demo app based on Next.js that can service their API, so I tested it using it. In this article, I will build and run this web app locally and finally deploy it to Netlify and AWS Amplify.

App Code Repository : https://github.com/HumeAI/hume-api-examples

You can try it out here: https://www.hume.ai

Requirements

Node

Download Demo app code

git clone https://github.com/HumeAI/hume-api-examples
cd hume-api-examples

Development

npm install
npm run dev

Build

npm build

Development mode will start serving on localhost:3001.

The choice for hosting static websites

Two popular options that web developers often utilize to host static websites or single page apps (SPAs) are Netlify and AWS Amplify. Both platforms support automatic builds and hosting in Git repositories, making static site deployment very easy. In this article, we'll compare the pros and cons of each based on real-world examples of NextJS app deployments.
We've deployed Hume AI's Example demo app written in Next.js to Netlify and AWS Amplify respectively.

Netlify: https://6575c5e5fc89902adec12a2c--delicate-dodol-3779df.netlify.app
Amplify: https://dev.d1c4gaqa0opsnq.mynextapp.com/

Netlify

Netlify brands itself as an "All-in-one platform for automating modern web projects." It offers continuous deployment from Git, serverless functions, a global CDN, domain management, SSL/TLS certificates, and more.

Deploy and Hosting

The simplest way to host with Netlify is to upload the output files built with 'Deploy manually' by dragging and dropping or browsing to upload, and you can easily connect to the automatically generated URL to host your web app.

After it finishes running, you'll have a URL hosted by Netlify.
Netify url : https://6575c5e5fc89902adec12a2c--delicate-dodol-3779df.netlify.app

Pros:

1.Excellent developer experience with instant cache invalidation and atomic deploys
2. Simple HTTPS setup with free Let's Encrypt SSL/TLS certificates
3. Form handling without needing a backend
4. Serverless functions with generous free tier
5. Plugin ecosystem for extending builds
6. GitHub integration with deploy previews

Cons:

1. Paid plans required for features like split testing, analytics, etc.
2. Limited control over CDN caching and behavior

AWS Amplify

AWS Amplify is a service that provides an all-in-one solution to build and deploy full-stack serverless web apps, including hosting for single page apps and static sites. It layers on top of other AWS services.

Deploy and Hosting

Requirements

- [Node](https://nodejs.org/)
- AWS Account

Install Amplify CLI

npm install -g @aws-amplify/cli

Configure the Amplify CLI

amplify configure

Adding Amplify hosting

$ amplify init

? Enter a name for the project: mynextapp
? Enter a name for the environment: dev
? Choose your default editor: Visual Studio Code (or your preferred editor)
? Choose the type of app that youre building: javascript
? What javascript framework are you using: react
? Source Directory Path: src
? Distribution Directory Path: out
? Build Command: npm run-script build
? Start Command: npm run-script start
? Do you want to use an AWS profile? Y
? Please choose the profile you want to use: <your profile>

Add hosting with the Amplify add command:

amplify add hosting

Deploy the app with the Amplify publish command:

You can integrate Amplify's Framework via Amplify configuration in the Source Code folder without accessing the Web UI, and run it from build to deployment in a single step using the 'amplify publish' command in the CLI environment.

amplify publish

✔ Successfully pulled backend environment dev from the cloud.

    Current Environment: dev

┌──────────┬────────────────┬───────────┬───────────────────┐
│ Category │ Resource name  │ Operation │ Provider plugin   │
├──────────┼────────────────┼───────────┼───────────────────┤
│ Hosting  │ amplifyhosting │ Create    │ awscloudformation │
└──────────┴────────────────┴───────────┴───────────────────┘

? Are you sure you want to continue? Yes

After it finishes running, you'll have a URL hosted by AWS Amplify.
like 'Amplify url : https://dev.d1c4gaqa0opsnq.mynextapp.com/'

Pros:

1. Tightly integrated with the AWS ecosystem


Custom CDN behaviors and cache policies
Feature branches and pull request previews
CI/CD capabilities beyond just hosting
Built-in monitoring and logging

Cons:

1. More complex configuration than Netlify
No free tier, pay-per-use pricing
SSL certificates require provisioning

Conclusion

Both Netlify and AWS Amplify offer powerful solutions for hosting static web pages with unique features and considerations. Netlify is ideal for small to medium-sized projects with simple deployment requirements due to its **simplicity and ease of use. On the other hand, AWS Amplify offers advanced customization options and seamless integration with the broader AWS ecosystem, making it ideal for larger projects that require scalability and flexibility. Ultimately, the choice between Netlify and AWS Amplify depends on your project requirements, budget, and familiarity with the respective platforms.

In the end, the best platform for you depends on your specific needs and skill level. Both Netlify and AWS Amplify offer free trials, so we recommend trying both to see which platform is best for you.

Additional Information

Allow insecure connection on web browser setting

Setting up for insecure content

For Mixed content problem caused by insecure(without ssl) connection with another app manager

Chrome : Settings > Privacy and security > Site settings > Additional content settings > Insecure content
Add the sites :
- [*.]netlify.app
- [*.]amplifyapp.com

References

Developing service with AWS CDK

Yeonggyoo Jeon — Fri, 28 Oct 2022 02:31:17 +0000

1. What is AWS CDK(Cloud Development Kit)?

AWS CDK is an open-sourced software development framework which possible to design cloud infra with codes. AWS opened to the public in July 2019. The official website defines it as "lets you define your cloud infrastructure as code in one of its supported programming languages". Unlike the previous IaC, the CDK provides a syntax for configuring cloud resources with default values verified using existing familiar programming languages. This allows non-professionals to start developing cloud applications by configuring service infra quickly, easily, and in a secure and reusable way.

Which IaC would I use while I develop with AWS?

In my case, Infrastructure management in service development using the public cloud seems to have gone through the following process.

a. Deploy/Manage service infra using the UI console

If you use the AWS Console to configure the infrastructure for service development or hands-on, and o design and deploy the entire service architecture with a mouse click, you would have the question 'Is it really possible to configure a complex system through this process?'. In addition, it becomes inconvenient when you have to build the same infrastructure elements over and over again or rebuild the entire infrastructure across multiple clusters.

b. Use the CLI command

Feeling limited by the click-and-click approach, people usually start deploying services and configuring cloud infrastructure using the AWS CLI as the script of the command line interface. However, it is hard to cope with retrying in the error situation or coping with the processing the race condition in the multi-task.

c. Interest into the IaC (Infra as Code)

You can also learn AWS CloudFormation script, tools that allow you to manage service infrastructure. Even you can choose Hashicorp's Terraform to describe and manage the infrastructure components in a corresponding format of Terraform script. However, I was doubtful when I look at those kinds of scripts as IaC and watched the process of the work with it from the side. From a programmer's view, the previous IaC script with the format of JSON or Yaml seems like too much repeating of the same texts. Therefore, it was clear that even a small increase in the size of the system would result in the number and files becoming too large to manage.

d. CDK, the truly programmable IaC

I found out about the AWS CDK when I felt that scripting IaC is not my way. CDK drew interestd me who was developing software for a long time as it is possible to write programming language and design it with software development skills that were getting improved over decades. Therefore, I thought that it would possible to be more efficient in designing, composing, and managing service infrastructure with CDK. Furthermore, this idea has become even more solid now that I have completed the API service development project using CDK.

The comparison with Terraform, the representative of the existing IaC, is attached in the table below.

Terraform VS AWS CDK

-	Terraform	CDK
Programming feature	Implement with Yaml or HCL, Need to learn a new language different from the existing ones. Auto-completion by installing an assist tools(Not perfect). Difficult to implement safely due to compile errors, ... etc	5 existing programming languages are possible to use. Various extensions are possible using OOP. flexible structure and reusability can be improved by using patterns. Non-limited expandability by the author's programming ability. Automation through existing IDE(VSCode). Safe implementation through compile errors
Workload for composing infra	Many references to configuring infrastructure as an approach from IaaS	Combining the latest technologies in the container/serverless architecture rather than IaaS. Fully support IaaS too.
Support for Public Cloud	Support various public cloud	Specialized in AWS. High grouth potential because ecosystems such as CDK for Terraform and CDK8S ar being created.
License, Maturity and Stability	Requires an enterprise contract for more than basic feature. Slightly lowered the stability in deployment as directly with SDK	Free for CloudFormation, ParameterStore, Matured the stability in deployment using CloudFormation as the backend.

Who's needed for?

Rather than using the AWS Console, you want more flexible, convenient, and robust infrastructure management.
I want to configure/manage resources using IaC, but I can't afford to manage monotonous and huge yaml files.
I want to actively utilize my programming-related development skills for IaC writing.
I want the optimal IaC to configure the service with a serverless architecture in AWS. Developers with the above concerns can consider CDK as a tool for designing and managing service infrastructure.

How is it operate?

Written in a software development language, cloud infrastructure can be modeled as reusable components. Currently, 5 development languages (Typescript, Javascript, Python, Java, C#) are supported. If you look at the operation process of CDK, the written CDK application is executed with CDK CLI, synthesized into CloudFormation Template, and distributed through AWS CloudFormation. It might help to understand if you think of it as running through.

[Deploying infra by writing code]

2. How to start?

Since the official Amazon documentation is well done, you can easily access it by referring to the documentation. You can refer to the Developer Guide and refer to the API Reference for detailed specifications of APIs during actual development. In the case of CDK API, since AWS services are well abstracted, I think that it is a good approach to refer to the CDK API Reference as a way to learn AWS services. In addition, a hands-on lab for first-time users is provided through the CDK Workshop page.

Developer Guide (https://docs.aws.amazon.com/cdk/latest/guide/home.html)
API Reference (https://docs.aws.amazon.com/cdk/api/latest/docs/aws-construct-library.html)
CDK Workshop (https://cdkworkshop.com/)

Development using AWS CDK

cdk is implemented using node. Therefore, no matter what programming language is used for development, node must be installed by default. Install aws-cdk using npm in the environment where node is installed. After initializing the project through 'cdk init', install the necessary node libraries through 'npm install'. Preparations for cdk development are complete, and now, cdk development through coding and development of the stack are checked for errors in the code through 'cdk list'. You just have to proceed. During development and debugging, 'list', 'diff', and 'synth' are used, and 'cdk deploy' is repeatedly used when deploying the confirmed code. If the service is finally terminated or the infrastructure needs to be deleted, you can cleanly delete the installed applications and infrastructure with 'cdk destroy'.

Install CDK CLI

npm install -g aws-cdk
cdk version
mkdir hello-cdk
cd hello-cdk
cdk init --language [typescript/javascript/python/java/csharp]
npm install

Commands of CDK

cdk bootstrap       # Deploy the stack for the CDK Toolkit in your AWS environment
cdk init        # Initializes a new default application in the language of the user's choice
cdk diff        # Identify differences between local AWS CDK code and applications running on AWS 
cdk synth       # Compiling AWS CDK Applications to AWS CloudFormation Templates
cdk deploy      # Deploy the AWS CDK application to a set up AWS account 
cdk destory     # Delete the deployed CDK application

In this article, I will leave it as an introduction and plan a detailed article for developing with CDK next time for more detailed development methods.
For those of you who want to test a little more, please use the reference link above and try the hands-on lab on the CDK Workshop page, it seems to be helpful for understanding.

3. Feature in development with CDK

Advantages of AWS development using CDK

CDK is developed using Constructs that abstract AWS resources into high-level components. You can use the basic settings used in the best practices of AWS services just by referring to these Constructs, and further understand each service in depth just by understanding the API documents.

[AWS CDK Construct]

On the AWS CDK official page, the advantages of developing using cdk are divided into 4 categories as follows.

Easier cloud onboarding

Even if you're new to using AWS, you can speed up your onboarding to the Cloud. cdk's API abstracts AWS resources into high-level components and is initialized with optimal default settings, so you can configure an appropriate system without an expert.

Faster development process

Since the infrastructure is defined using the characteristics of the programming language, efficient and fast development is possible depending on how the logic such as OOP, loop, and conditional statements are configured.

Customizable and shareable

It is possible to design with reusable components suitable for each requirement, and it is possible to quickly expand security, regulations, and compliance with requirements through easy sharing of libraryd components.

No context switching

Since code development and distribution are possible in the IDE in the development environment, developers can develop applications and manage infrastructure without changing or setting up a separate development environment.

References in the solutions constructs

There are structures that are patterned by arranging components that can be used repeatedly, and I have made them into solutions and made them into CDK Constructs. You can find it on the AWS Solutions Constructs page, take it, use it, or refer to it and reuse it in the system configuration you develop to more quickly construct your desired service design.

Ecosystem Expansion

CDK is a great choice for development on AWS and has great advantages, especially when designing serverless architectures. However, you may think that it is not possible to use other public clouds or to directly build an on-premise or IaaS-focused Kubernetes cluster. In order to solve these points, there are projects that can link the advantages of CDK to other platforms and make Kubernetes design with cdk, together with other platforms or CNCF.

Terraform-cdk

It was started supporting the CDK as a tool for defining and provisioning infrastructure using Terraform.
[Terraform Providers]

(https://www.hashicorp.com/blog/cdk-for-terraform-enabling-python-and-typescript-support/)
(https://www.hashicorp.com/blog/announcing-cdk-for-terraform-0-1)
CDK8s

You can use CDK for Kubernetes (CDK8s) to define and manage Kubernetes applications. CDK8s is currently registered as CNCF's Sandbox Project.
[Workflow of CDK8s]

(https://aws.amazon.com/ko/blogs/korea/using-cdk8s-for-kubernetes-applications/)
(https://aws.amazon.com/blogs/containers/introducing-cdk-for-kubernetes/)

Ending

In this article, I introduced the CDK and briefly explained the development flow of how development proceeds. In the next article, I'm planning an article that explains in more detail how to proceed with CDK development. My team is developing a system that provides APIs related to Vision AI using AWS CDK. Next, I will be able to share related tips and development tips on the blog while development is in progress.

Thank you.

Applying for the AWS Community Builder

Yeonggyoo Jeon — Tue, 25 Oct 2022 07:31:57 +0000

AWS Community Builder is a DevRel(Developer Relation) program operated by AWS. AWS operates those two DevRel programs each the AWS community builder and the AWS hero.

1. What is the AWS community builder

Official webpage : https://aws.amazon.com/developer/community/community-builders/

One of the DevRel programs which being officially operated by AWS.
Provide technical resources, helps and opportunities for human networking.
Share technical knowledge about AWS and connect global tech communities.

Operated in the technical areas for each category below

Containers
Data (Databases, Analytics and BI)
Developer Tools
Front-End Web and Mobile
Game Tech
Graviton/Arm Development
Cloud Ops
Machine Learning
Network Content & Delivery
Security & Identity
Serverless
Storage There are no restrictions on activities by category, although applications are made for each one.

2. What the benefits are providing?

If you participate as the Community Builder, would be invited to the Slack channel which bring together the program operators and builders. The benefits of the Community builders introduced here are as follows.

Connecting with others :
Webinars & Briefings :
Certification Exam Vouchers :
Swag Welcome Kit :
AWS Credit :
A Free One-Year Subscription to Cloud Academy :
Third-Party ISV Offers :
Publish Content on Dev.to :
Super sweet logos, Wallpaper & Assets
re:Invent discount passes:
Service beta opportunities

WelcomeKit

3. Ending

It would great opportunity for those who are interested in AWS technology and like sharing knowledge with community or blog. The application is made twice a year and the duration of the activity is one year and can be extended. I hope that many people around the world will share their know-how and knowledge about AWS-related technologies through this community activity.