DEV Community

Cover image for Serverless Inference, Cost Optimization, CI/CD Pipelines, and Multi-Region Architecture for FSx for ONTAP S3 Access Points — Phase 5

Serverless Inference, Cost Optimization, CI/CD Pipelines, and Multi-Region Architecture for FSx for ONTAP S3 Access Points — Phase 5

TL;DR

This is Phase 5 of the FSx for ONTAP S3 Access Points serverless patterns collection. Building on the Phase 1 foundation, the 14 industry patterns from Phase 2, the near-real-time + ML + observability stack from Phase 3, and the production SageMaker + Multi-Account + Event-Driven from Phase 4, Phase 5 delivers:

  • SageMaker Serverless Inference: 3rd routing option completing the Batch/Real-time/Serverless trifecta with cold start handling and automatic fallback
  • Cost Optimization Suite: Scheduled Scaling, CloudWatch Billing Alarms (3-tier), Auto-Stop Lambda for idle endpoint detection, and a comprehensive cross-phase cost guide
  • CI/CD Pipeline: GitHub Actions with OIDC authentication, 4-stage gating (cfn-lint → pytest → cfn-guard → Bandit), staging/production deployment with manual approval
  • Multi-Region Architecture: DynamoDB Global Tables for Task Token Store replication, CrossRegionClient failover, DR Tier 1/2/3 definitions with failover runbooks

All AWS runtime features remain opt-in via CloudFormation Conditions (default disabled, zero additional cost). The CI/CD pipeline is provided as an optional GitHub Actions workflow. 15 property-based tests (Hypothesis) validate correctness invariants across all themes.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns


Introduction

Phase 4 delivered production SageMaker integration, Multi-Account deployment, and an Event-Driven prototype. The "What's Next" section outlined four remaining gaps:

  1. Serverless Inference: Sporadic workloads need a pay-per-request option without always-on instances
  2. Cost optimization: Real-time Endpoints cost ~$215/month — operators need automated cost controls
  3. CI/CD automation: Manual deployment doesn't scale for teams with multiple contributors
  4. Multi-Region resilience: Single-region architectures have a blast radius equal to the entire system

Phase 5 addresses all four while maintaining the project's core principle: every feature is opt-in with zero cost when disabled.


Summary Table

Feature Component AWS Services Key Metric
Serverless Inference 3-way routing SageMaker ServerlessConfig, Step Functions Cold start: 6–45s
Scheduled Scaling Business hours scaling Application Auto Scaling Cost reduction: up to 70%
Billing Alarms 3-tier alerts CloudWatch, SNS Warning/Critical/Emergency
Auto-Stop Idle detection Lambda, CloudWatch Metrics 60-min idle threshold
CI/CD Pipeline 4-stage gating GitHub Actions, OIDC All stages must pass
Multi-Region Global Tables + failover DynamoDB Global Tables, Route 53 RPO near-zero (Tier 1)
Disaster Recovery Tier 1/2/3 Route 53, EventBridge, Step Functions RTO: 5min–4h

Theme A: SageMaker Serverless Inference

The Three Inference Paths

Phase 5 completes the inference routing trifecta that was previewed in Phase 4:

Path Trigger Latency Cost Model Best For
Batch Transform file_count >= threshold OR InferenceType == "none" Minutes Per-job Large batch processing
Real-time Endpoint file_count < threshold AND InferenceType == "provisioned" Milliseconds Per-instance-hour Consistent traffic
Serverless Inference InferenceType == "serverless" Seconds (warm) / 6–45s (cold) Per-request Sporadic, unpredictable traffic

Deterministic 3-Way Routing

The routing logic in shared/routing.py is deterministic — the same inputs always produce the same output:

def determine_inference_path(file_count: int, batch_threshold: int, inference_type: str) -> InferencePath:
    if inference_type == "none":
        return InferencePath.BATCH_TRANSFORM
    if inference_type == "serverless":
        return InferencePath.SERVERLESS_INFERENCE
    if file_count >= batch_threshold:
        return InferencePath.BATCH_TRANSFORM
    return InferencePath.REALTIME_ENDPOINT
Enter fullscreen mode Exit fullscreen mode

This is validated by Property Test #1 (Three-Way Routing Determinism): for any combination of file_count, batch_threshold, and inference_type, exactly one path is selected, and calling the function twice with the same inputs produces the same result.

Cold Start Handling

Serverless Inference introduces a unique challenge: ModelNotReadyException during cold starts. Our implementation handles this with:

  1. Extended initial timeout: 60 seconds (vs. standard 30s for Real-time)
  2. Retry with backoff: 3-second delay, maximum 2 retries
  3. Total timeout guard: initial_timeout + (retry_delay × max_retries) <= step_functions_task_timeout (120s)
  4. Cold start detection: Latency > 5000ms triggers ColdStartDetected EMF metric
  5. Automatic fallback: On timeout, the Step Functions Catch block routes to Batch Transform
# Step Functions definition (simplified)
ServerlessInferencePath:
  Type: Task
  TimeoutSeconds: 120
  Catch:
    - ErrorEquals: ["States.TaskFailed", "States.Timeout"]
      Next: BatchTransformFallback
Enter fullscreen mode Exit fullscreen mode

ServerlessConfig Validation

The validate_serverless_config() function enforces SageMaker constraints:

  • MemorySizeInMB: Must be one of {1024, 2048, 3072, 4096, 5120, 6144}
  • MaxConcurrency: Must be in range [1, 200]

Property Test #2 validates these constraints hold for all possible inputs.


Theme B: Cost Optimization + Scheduled Scaling

Scheduled Scaling

The most impactful cost optimization for SageMaker Endpoints is time-based scaling. A Real-time Endpoint running 24/7 costs ~$215/month, but most workloads only need it during business hours:

# shared/cfn/scheduled-scaling.yaml (nested stack)
ScaleUpAction:
  Type: AWS::ApplicationAutoScaling::ScheduledAction
  Properties:
    Schedule: "cron(0 9 ? * MON-FRI *)"
    Timezone: "Asia/Tokyo"
    ScalableTargetAction:
      MinCapacity: !Ref BusinessMinCapacity
      MaxCapacity: !Ref BusinessMaxCapacity

ScaleDownAction:
  Type: AWS::ApplicationAutoScaling::ScheduledAction
  Properties:
    Schedule: "cron(0 18 ? * MON-FRI *)"
    Timezone: "Asia/Tokyo"
    ScalableTargetAction:
      MinCapacity: !Ref OffHoursMinCapacity
      MaxCapacity: !Ref OffHoursMaxCapacity
Enter fullscreen mode Exit fullscreen mode

Cost impact: Up to ~70% reduction when off-hours capacity can be reduced to zero (Inference Components) or endpoints are deleted in non-production. For standard Real-time Endpoints with MinCapacity=1, savings depend on the business-hours vs. off-hours capacity delta (e.g., scaling from 4 instances to 1 during off-hours = 75% off-hours savings).

Property Test #5 validates that business_hours_start < business_hours_end is enforced, and Property Test #6 validates that off_hours_max_capacity <= business_min_capacity (guaranteeing cost reduction).

Billing Alarms (3-Tier)

Three escalation levels with strict ordering:

# shared/cfn/billing-alarm.yaml
WarningAlarm:    # Monthly spend > $100 → email notification
CriticalAlarm:   # Monthly spend > $200 → email + escalation
EmergencyAlarm:  # Monthly spend > $500 → immediate action required
Enter fullscreen mode Exit fullscreen mode

Region requirement: AWS billing metrics (AWS/Billing namespace) are published only in US East (N. Virginia) / us-east-1. Deploy billing alarm stacks in us-east-1 regardless of where your workloads run. See AWS documentation.

Property Test #7 validates the invariant: warning < critical < emergency.

Auto-Stop Lambda

Runs hourly via EventBridge Schedule, checking all SageMaker Endpoints for idle status:

# Simplified logic
for endpoint in list_endpoints(project_prefix):
    if has_tag(endpoint, "DoNotAutoStop", "true"):
        continue  # Protected endpoint
    if get_invocations_last_n_minutes(endpoint, idle_threshold) == 0:
        apply_cost_saving_action(endpoint)
        emit_metric("EndpointsStoppedCount", 1)
Enter fullscreen mode Exit fullscreen mode

The Auto-Stop Lambda detects idle endpoints and applies the configured cost-saving action. Depending on the endpoint type and organizational policy, the action can be:

  • Scale down to minimum supported capacity (for endpoints with Inference Components that support MinInstanceCount=0)
  • Disable scheduled capacity (revert to off-hours minimum)
  • Delete the endpoint in non-production environments (configurable via AUTO_STOP_ACTION environment variable)

Important: SageMaker Real-time Endpoints (standard ProductionVariant-based) do not support scaling to zero instances. Only endpoints hosting Inference Components with ManagedInstanceScaling.MinInstanceCount=0 can scale in to zero. For standard endpoints, the Auto-Stop action defaults to deleting the endpoint in non-production environments or reducing to MinCapacity=1.

Key design decisions:

  • Configurable action: Scale down, disable scheduling, or delete (based on environment)
  • Tag protection: DoNotAutoStop=true prevents any action (Property Test #8)
  • Non-destructive default: In production, never deletes — only adjusts capacity (Property Test #9)
  • DRY_RUN mode: Log-only mode for safe testing
  • EMF metrics: EstimatedSavingsPerHour for cost visibility

Theme C: CI/CD Pipeline

Pipeline Architecture

PR → main branch
  ├── Stage 1: cfn-lint (all CloudFormation templates)
  ├── Stage 2: pytest + Hypothesis (coverage ≥ 80%)
  ├── Stage 3: cfn-guard (security compliance)
  └── Stage 4: Bandit + pip-audit (code security)
      ↓ All stages pass
  Deploy to Staging (auto)
      ↓ Smoke test passes
  Manual Approval (Environment Protection Rules)
      ↓ Approved
  Deploy to Production
Enter fullscreen mode Exit fullscreen mode

Strict Gating

Property Test #10 validates the gating invariant: if any stage reports "fail", the final pipeline status is "failure". Only when all stages report "pass" does the pipeline succeed.

OIDC Authentication

No long-lived AWS credentials stored in GitHub:

permissions:
  id-token: write
  contents: read

- uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/github-actions-deploy
    aws-region: ap-northeast-1
Enter fullscreen mode Exit fullscreen mode

Deployment Stage Ordering

Property Test #11 validates: production deployment is only permitted when staging succeeds AND smoke tests pass. Staging failure or smoke test failure blocks production.

Security Rules (cfn-guard)

Five rule files enforce security compliance:

Rule Enforcement
iam-least-privilege.guard No Action: "*" + Resource: "*"
encryption-required.guard KMS encryption on DynamoDB, S3, SNS
lambda-limits.guard Timeout ≤ 900s, Memory ≤ 10240MB
no-public-access.guard No public S3, no 0.0.0.0/0 SG ingress
sagemaker-security.guard VPC config, encryption required

Property Test #12 validates: any IAM policy with Action: "*" AND Resource: "*" is flagged as a violation.


Theme D: Multi-Region Architecture

DynamoDB Global Tables

The Task Token Store (introduced in Phase 4) is extended to a Global Table for cross-region replication:

# shared/cfn/global-task-token-store.yaml
Type: AWS::DynamoDB::GlobalTable
Properties:
  TableName: !Sub "${AWS::StackName}-task-token-store"
  BillingMode: PAY_PER_REQUEST
  StreamSpecification:
    StreamViewType: NEW_AND_OLD_IMAGES
  Replicas:
    - Region: ap-northeast-1
      PointInTimeRecoverySpecification:
        PointInTimeRecoveryEnabled: true
    - Region: us-east-1
      PointInTimeRecoverySpecification:
        PointInTimeRecoveryEnabled: true
  TimeToLiveSpecification:
    AttributeName: ttl
    Enabled: true
Enter fullscreen mode Exit fullscreen mode

Key properties:

  • Version 2019.11.21: Latest Global Tables version
  • PAY_PER_REQUEST: No capacity planning needed
  • TTL propagation: Automatic across all replicas
  • Consistency: Multi-Region Eventual Consistency (MREC) by default — writes replicate typically within one second, but cross-region reads may return stale data

Important: Step Functions task tokens are tied to the regional execution that generated them. Global Tables replicate the token metadata for availability, but failover logic must still call back to the correct execution region, or start a secondary workflow execution during regional failover. The token record includes execution_region and state_machine_arn attributes to avoid cross-region callback ambiguity.

CrossRegionClient Failover

The shared/cross_region_client.py is extended with automatic failover:

def access_with_failover(self, operation, **kwargs):
    """Primary → Secondary automatic failover."""
    try:
        return self._execute(self.primary_region, operation, **kwargs)
    except (TimeoutError, ConnectionError, ServerError) as e:
        self._emit_metric("CrossRegionFailoverCount", 1)
        return self._execute(self.secondary_region, operation, **kwargs)
Enter fullscreen mode Exit fullscreen mode

Property Test #13 validates: Primary region is always tried first, and Secondary is only attempted when Primary fails with timeout, 5xx, or connection error.

Disaster Recovery Tiers

Tier RPO RTO Strategy Monthly Cost Premium
Tier 1 Near-zero (typically < 1s) < 5 min Active-Active (Global Tables + dual SF) +100%
Tier 2 < 1 hour < 30 min Warm Standby (Global Tables + standby Lambda) +30–50%
Tier 3 < 24 hours < 4 hours Backup & Restore (S3 cross-region + manual) +5–10%

Note: Actual RPO for Tier 1 depends on DynamoDB Global Tables cross-region replication lag (typically under 1 second with MREC) and workload-specific write ordering requirements. For strict RPO=0, consider Multi-Region Strong Consistency (MRSC) mode, which synchronously replicates writes to at least one other region before returning success.

Route 53 health checks and failover routing direct event producers and API callers to the active region. For Tier 1, failover is automatic (health check failure triggers routing change within 10–30 seconds). For Tier 2/3, failover can be manual or semi-automated via CloudWatch Composite Alarms. See the Disaster Recovery Guide for detailed runbooks.

Active-Passive Guarantee

Property Test #15 validates: when the Primary region health check is "healthy", the Secondary region does NOT process events. Only when Primary is "unhealthy" does Secondary activate.

Resource Isolation

Property Test #14 validates: Primary and Secondary region resource names never collide, preventing accidental cross-region interference.


Design Principles

Opt-in Everything (Continued)

Phase 5 adds new Conditions:

Conditions:
  IsServerlessInference: !Equals [!Ref InferenceType, "serverless"]
  HasProvisionedConcurrency: !Not [!Equals [!Ref ServerlessProvisionedConcurrency, "0"]]
  EnableScheduledScalingCondition: !Equals [!Ref EnableScheduledScaling, "true"]
  EnableBillingAlarmsCondition: !Equals [!Ref EnableBillingAlarms, "true"]
  EnableAutoStopCondition: !Equals [!Ref EnableAutoStop, "true"]
  EnableMultiRegionCondition: !Equals [!Ref EnableMultiRegion, "true"]
Enter fullscreen mode Exit fullscreen mode

Non-Breaking Guarantee

Phase 5 additions do not modify any existing Phase 1/2/3/4 code:

  • shared/routing.py is a new module (not modifying existing files)
  • shared/cost_validation.py is a new module
  • Existing Lambda functions continue to work unchanged
  • All existing tests pass without modification

Property-Based Testing (Hypothesis)

Phase 5 introduces 15 correctness properties:

# Property Validates
1 Three-Way Routing Determinism Same inputs → same path, always exactly one path
2 ServerlessConfig Validation MemorySizeInMB ∈ {1024..6144}, MaxConcurrency ∈ [1,200]
3 Serverless Invocation Timeout Bound Total retry time ≤ Step Functions timeout
4 Inference Type Response Transparency Both modes return identical key sets
5 Scheduled Scaling Time Ordering start < end enforced
6 Cost Reduction Guarantee off_hours_max ≤ business_min
7 Billing Alarm Threshold Ordering warning < critical < emergency
8 Auto-Stop Tag Protection DoNotAutoStop=true → never stopped
9 Non-Destructive Cost Action Guarantee In production, action is "scale to minimum" or "disable scheduling", never "delete"
10 CI Strict Gating Any "fail" → pipeline "failure"
11 Deployment Stage Ordering staging success required for production
12 No Admin Access in IAM Policies Action:* + Resource:* → violation
13 Cross-Region Failover Ordering Primary first, Secondary only on failure
14 Multi-Region Resource Isolation No resource name collisions
15 Active-Passive Guarantee Secondary inactive when Primary healthy

Property Test #12 focuses on the highest-risk wildcard pattern (Action: "*" combined with Resource: "*"). Additional organization-specific cfn-guard rules can extend detection to privileged service-level wildcards (e.g., iam:*) or managed policy attachments like AdministratorAccess.


Cost Impact

Feature Default Monthly Cost (when enabled)
Serverless Inference (no PC) Disabled $1–300/month (depends on memory size, invocation count, and processing duration per request)
Serverless Inference (PC=1) Disabled ~$50–160 (PC fixed cost)
Scheduled Scaling Disabled $0 (Auto Scaling feature)
Billing Alarms Disabled ~$0.30 (3 alarms)
Auto-Stop Lambda Disabled ~$0 (hourly invocation)
CI/CD Pipeline N/A $0 (GitHub Actions free tier for public repos; varies for private repos or larger runners)
DynamoDB Global Tables Disabled ~$0–5 (PAY_PER_REQUEST)
Multi-Region (full) Disabled +30–100% of base cost

Actual multi-region cost depends on replica count, replicated write volume, standby compute, and observability retention.


Lessons Learned

1. Serverless Inference Cold Start is Highly Variable

Cold start latency varies significantly based on model size, container image size, and framework initialization. Our testing showed 6–45 second range for the same endpoint configuration. The retry strategy (3s delay × 2 retries) handles most cases, but the Batch Transform fallback is essential for reliability.

2. Scheduled Scaling Requires Timezone Awareness

AWS Application Auto Scaling scheduled actions use UTC by default. For JST-based business hours, the Timezone parameter must be explicitly set to Asia/Tokyo. Without this, scaling actions fire at wrong times.

3. Global Tables Require Stream Specification

DynamoDB Global Tables (Version 2019.11.21) require StreamSpecification: NEW_AND_OLD_IMAGES. Attempting to create a Global Table without streams results in a validation error. This is a hard requirement for cross-region replication.

4. GitHub Actions OIDC Requires Careful IAM Trust Policy

The OIDC trust policy must match the exact GitHub repository and branch pattern. A common mistake is using * for the subject claim, which allows any repository to assume the role. Always scope to the specific repository and branch.

5. cfn-guard Rules Need Careful Scoping

Overly broad cfn-guard rules can block legitimate configurations. For example, a blanket "no wildcard actions" rule blocks logs:CreateLogGroup which is commonly needed. Rules should target specific high-risk patterns (Action:* + Resource:*) rather than any individual wildcard.


Screenshots

All screenshots are from the ap-northeast-1 (Tokyo) verification environment. Account IDs and environment-specific information have been masked.

SageMaker Serverless Inference Endpoint

SageMaker Serverless Endpoint Settings

Serverless Inference Endpoint settings: Memory 4096 MB, Max Concurrency 5. No provisioned instances — compute allocated on-demand per request.

SageMaker Serverless Endpoint Config

Endpoint Configuration detail showing the ServerlessConfig parameters. This is the third routing option alongside Batch Transform and Real-time Endpoint.

SageMaker Serverless Endpoint Creating

Endpoint creation in progress. No provisioned instances are maintained for standard Serverless Inference — compute is allocated on demand per request, which is why cold starts (6–45 seconds observed) can occur after idle periods.

CloudWatch Billing Alarms (3-Tier)

CloudWatch Billing Alarms

Three-tier billing alarms: Warning ($100), Critical ($200), Emergency ($500). Each tier triggers SNS notification with escalating urgency. All alarms in OK state during verification.

DynamoDB Global Table (Multi-Region)

DynamoDB Global Table

DynamoDB Global Table configuration for the Task Token Store. Multi-Region replication enabled between ap-northeast-1 and us-east-1.

DynamoDB Global Replicas

Global Table replica status showing active replication across regions. TTL propagation and PITR enabled on all replicas.


What's Next

  • FSx ONTAP S3 AP native events: When available, migrate from polling to event-driven with the Phase 4 prototype as the blueprint
  • SageMaker Inference Components: Explore hosting multiple models on a single endpoint for further cost optimization, including scale-to-zero patterns where ManagedInstanceScaling.MinInstanceCount=0 is supported
  • Lambda SnapStart for Python: Reduce Discovery/Processing Lambda cold starts toward sub-second startup times with SnapStart for supported Python runtimes and regions
  • Local testing + deploy-time policy enforcement: Expand validation with sam local invoke / Finch support, and enforce cfn-guard rules at deployment using CloudFormation Guard Hooks

Impact Assessment

All Phase 5 features are opt-in and disabled by default. For a comprehensive evaluation of the impact on existing environments when enabling features across all phases (1–5), including safe enablement order, rollback procedures, and cost impact summary, see the Existing Environment Impact Assessment Guide.


Conclusion

Phase 5 transforms the FSxN S3AP Serverless Patterns into a production-ready, cost-optimized, multi-region reference architecture:

  • Serverless Inference completes the inference routing trifecta, giving operators the right tool for every traffic pattern
  • Cost Optimization Suite provides automated controls that can reduce SageMaker costs by up to 70%
  • CI/CD Pipeline enables team collaboration with automated quality gates and safe deployments
  • Multi-Region Architecture provides resilience patterns from simple backup (Tier 3) to near-zero-RPO active-active targets (Tier 1)

All AWS runtime features remain opt-in, and the CI/CD pipeline remains optional, maintaining the project's core principle: learn from the design decisions without paying for resources you don't need.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns


This article is part of the "FSx for ONTAP S3 Access Points" series. See Phase 1, Phase 2, Phase 3, and Phase 4 for the foundation.

Top comments (0)