DEV Community

Cover image for Production SageMaker Patterns, Multi-Account Deployment, and Event-Driven Architecture for FSx for ONTAP S3 Access Points — Phase 4

Production SageMaker Patterns, Multi-Account Deployment, and Event-Driven Architecture for FSx for ONTAP S3 Access Points — Phase 4

TL;DR

This is Phase 4 of the FSx for ONTAP S3 Access Points serverless patterns collection. Building on the Phase 1 foundation, the 14 industry patterns from Phase 2, and the near-real-time + ML + observability stack from Phase 3, Phase 4 delivers:

  • DynamoDB Task Token Store: Correlation ID pattern solving the 256-char SageMaker tag limit for production Step Functions Callback workflows
  • Real-time Inference + A/B Testing: SageMaker Multi-Variant Endpoints with intelligent routing and automated comparison metrics
  • Multi-Account Deployment Patterns: StackSets templates, cross-account IAM roles with S3 Access Point policies, and CloudWatch Cross-Account Observability for enterprise-scale rollout
  • Event-Driven Architecture Prototype: Standard S3 bucket events → EventBridge → Step Functions achieving 3.5-second E2E latency, as a future-compatible prototype for native FSx ONTAP S3 AP notifications

All features are opt-in via CloudFormation Conditions (default disabled, zero additional cost). 681 total tests pass, including 11 property-based tests (Hypothesis).

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns


Introduction

Phase 3 delivered near-real-time streaming, SageMaker Batch Transform with the Callback Pattern, and a full observability stack. But it left several production gaps:

  1. Task Token storage: The Phase 3 mock mode passed tokens directly, but SageMaker tag values are limited to 256 characters while Task Tokens are ~1,000 characters
  2. Inference latency: Batch Transform takes minutes — some use cases need millisecond-level responses
  3. Enterprise deployment: Single-account patterns don't scale to organizations with multiple workload accounts
  4. Event latency: Even with Kinesis, the polling-based change detection adds a 1-minute floor to end-to-end latency

Phase 4 addresses all four gaps while maintaining the project's core principle: every feature is opt-in with zero cost when disabled.


Summary Table

Feature Component AWS Services Verification
Task Token Store Correlation ID + DynamoDB DynamoDB, Step Functions, Lambda ✅ E2E (8-char hex round-trip, TTL cleanup)
Real-time Inference Multi-Variant Endpoint SageMaker, Auto Scaling ✅ Deployed (ml.m5.large, 1-4 instances)
A/B Testing Traffic Splitting + Comparison SageMaker, CloudWatch EMF ✅ Variant metrics aggregation verified
Multi-Account StackSets + Cross-Account IAM CloudFormation StackSets, IAM, CloudWatch ✅ Template validation (cross-account role chain)
Event-Driven Prototype S3 → EventBridge → Step Functions S3, EventBridge, Step Functions, Lambda E2E: 3.5 seconds (PutObject → processing complete)
Model Registry Approval Workflow SageMaker Model Registry ✅ Governance flow validated

Cost Impact

Feature Default Monthly Cost (when enabled)
DynamoDB Task Token Store Disabled ~$0 (PAY_PER_REQUEST, minimal reads/writes)
Real-time Endpoint Disabled ~$215/month per ml.m5.large instance/variant (ap-northeast-1)
A/B Testing (Multi-Variant) Disabled No separate A/B fee, but each variant with provisioned instances adds instance-hour cost (e.g., 2-variant = ~$430/month)
Auto Scaling (1-4 instances) Disabled Scales per variant; total cost depends on variant count × instance count
Model Registry Disabled $0 (metadata only)
Event-Driven Prototype Disabled ~$0 (pay-per-event, negligible at test scale)

Cost warning: The Real-time Endpoint is the only feature with significant ongoing cost (~$7/day per ml.m5.large instance/variant). A two-variant A/B test roughly doubles the baseline endpoint cost. The cleanup script (scripts/cleanup_phase4.sh --endpoint-only) stops billing within minutes. Always delete endpoints when not actively testing.


Theme A: DynamoDB Task Token Store

The Problem

In Phase 3, we implemented the SageMaker Callback Pattern by passing the Step Functions Task Token directly in SageMaker job tags. This works for prototyping but has a critical limitation: SageMaker tag values are limited to 256 characters, while Task Tokens are approximately 1,000 characters.

The Solution: Correlation ID Pattern

Step Functions (.waitForTaskToken)
  → SageMaker Invoke Lambda
    → Generate 8-char hex Correlation ID (UUID4 prefix)
    → Store {correlation_id → task_token} in DynamoDB (TTL: 24h)
    → Create Transform Job with tag: CorrelationId=abc12345

SageMaker Job Completes
  → EventBridge Rule triggers Callback Lambda
    → Extract CorrelationId from job tags
    → Retrieve task_token from DynamoDB
    → SendTaskSuccess/Failure to Step Functions
    → Delete DynamoDB record (cleanup)
Enter fullscreen mode Exit fullscreen mode

Key Design Decisions

Why 8-character hex IDs? With 32 bits of entropy, the collision probability is low for the expected concurrency in this reference implementation. Conditional writes detect collisions, and the Lambda retries with a new ID up to 3 times. For very high-concurrency production workloads (thousands of concurrent Batch Transform jobs), consider using a longer correlation ID such as 12 or 16 hex characters.

Why DynamoDB? Single-digit millisecond latency, automatic TTL cleanup (24-hour expiry), conditional writes for collision prevention, and pay-per-request pricing that costs effectively $0 at our scale.

Security: Task Tokens are never logged in plaintext. Only the correlation ID appears in CloudWatch Logs, providing auditability without exposing sensitive tokens.

Backward Compatibility: The TOKEN_STORAGE_MODE environment variable allows switching between dynamodb (new) and direct (Phase 3 compatible) modes. The Callback Lambda auto-detects the mode by checking which tag is present on the completed job.

DynamoDB Table Design

TaskTokenStore:
  Type: AWS::DynamoDB::Table
  Properties:
    TableName: !Sub "${AWS::StackName}-task-token-store"
    BillingMode: PAY_PER_REQUEST
    AttributeDefinitions:
      - AttributeName: correlation_id
        AttributeType: S
    KeySchema:
      - AttributeName: correlation_id
        KeyType: HASH
    TimeToLiveSpecification:
      AttributeName: ttl
      Enabled: true
Enter fullscreen mode Exit fullscreen mode

Conditional Write for Collision Prevention

dynamodb_table.put_item(
    Item={
        'correlation_id': correlation_id,
        'task_token': task_token,
        'job_name': transform_job_name,
        'created_at': int(time.time()),
        'ttl': int(time.time()) + 86400  # 24-hour TTL
    },
    ConditionExpression='attribute_not_exists(correlation_id)'
)
Enter fullscreen mode Exit fullscreen mode

If a collision occurs, the conditional write fails and the Lambda retries with a new ID up to 3 times.


Theme B: Real-time Inference + A/B Testing

Three Inference Patterns

Phase 4 implements two SageMaker inference patterns with routing logic, and frames Serverless Inference as a future third option for sporadic workloads:

Pattern Latency Cost Model Best For
Batch Transform Minutes Per-job Large batch processing (≥10 files)
Real-time Endpoint Milliseconds Per-instance-hour Consistent low-latency needs
Serverless Inference Seconds Per-request Sporadic, unpredictable traffic (planned)

Intelligent Routing

The Step Functions workflow uses a Choice state to route requests based on file count:

file_count < threshold (default: 10)
  → Real-time Endpoint (low latency, immediate response)

file_count >= threshold
  → Batch Transform (cost-efficient for large batches)
Enter fullscreen mode Exit fullscreen mode

This threshold is configurable via CloudFormation parameters, allowing operators to tune the routing based on their specific latency and cost requirements.

A/B Testing with Multi-Variant Endpoints

SageMaker's native traffic splitting enables A/B testing without additional infrastructure:

ProductionVariants:
  - VariantName: "model-v1"
    ModelName: !Ref ModelV1
    InitialInstanceCount: 1
    InstanceType: ml.m5.large
    InitialVariantWeight: 0.7  # 70% traffic
  - VariantName: "model-v2"
    ModelName: !Ref ModelV2
    InitialInstanceCount: 1
    InstanceType: ml.m5.large
    InitialVariantWeight: 0.3  # 30% traffic
Enter fullscreen mode Exit fullscreen mode

Auto Scaling Configuration

ScalingTarget:
  Type: AWS::ApplicationAutoScaling::ScalableTarget
  Properties:
    MinCapacity: 1
    MaxCapacity: 4
    ResourceId: !Sub "endpoint/${EndpointName}/variant/model-v1"
    ScalableDimension: sagemaker:variant:DesiredInstanceCount
    ServiceNamespace: sagemaker

ScalingPolicy:
  Type: AWS::ApplicationAutoScaling::ScalingPolicy
  Properties:
    PolicyType: TargetTrackingScaling
    TargetTrackingScalingPolicyConfiguration:
      TargetValue: 70  # Scale when invocations per instance > 70
      PredefinedMetricSpecification:
        PredefinedMetricType: SageMakerVariantInvocationsPerInstance
Enter fullscreen mode Exit fullscreen mode

Inference Comparison Lambda

The Inference Comparison Lambda runs every 5 minutes, aggregating per-variant metrics and emitting CloudWatch EMF metrics. The following is simplified pseudo-code. Invocation and error counts are collected separately from CloudWatch metrics such as Invocations, Invocation4XXErrors, and Invocation5XXErrors:

# Simplified pseudo-code — in production, select the latest datapoint
# from response["Datapoints"] and handle missing datapoints.
for variant_name in endpoint_variant_names:
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/SageMaker',
        MetricName='ModelLatency',
        Dimensions=[
            {'Name': 'EndpointName', 'Value': endpoint_name},
            {'Name': 'VariantName', 'Value': variant_name}
        ],
        StartTime=start_time,
        EndTime=end_time,
        Period=300,
        Statistics=['Average', 'SampleCount'],
        ExtendedStatistics=['p50', 'p90', 'p99']
    )
    # Select the latest datapoint; handle empty Datapoints list in production
    datapoint = select_latest_datapoint(response.get("Datapoints", []))
    emit_emf_metric(
        namespace='FSxN-S3AP-Patterns/Inference',
        dimensions={'Variant': variant_name},
        metrics={
            'AverageLatency': datapoint.get('Average'),
            'P99Latency': datapoint.get('ExtendedStatistics', {}).get('p99'),
            'InvocationCount': invocation_count,
            'ErrorRate': error_count / max(invocation_count, 1)
        }
    )
Enter fullscreen mode Exit fullscreen mode

Model Registry Integration

The SageMaker Model Registry provides a governance layer:

Training → Registration (PendingManualApproval) → Approval → Deployment
Enter fullscreen mode Exit fullscreen mode

Only approved models can be deployed to production endpoints, preventing accidental deployment of untested models.


Theme C: Multi-Account Deployment

Architecture

Management Account
  └── CloudFormation StackSets (deploy UC templates to workload accounts)

Storage Account
  ├── FSx ONTAP File System
  ├── S3 Access Points
  └── S3 Access Point policies + cross-account IAM roles

Shared Services Account
  ├── CloudWatch Observability Sink (cross-account metrics/logs/traces)
  ├── X-Ray Cross-Account Tracing
  └── SNS Aggregated Alerts

Workload Account(s)
  ├── UC Deployments (Lambda + Step Functions)
  ├── Cross-Account IAM Roles (with External ID)
  └── CloudWatch Sharing Links
Enter fullscreen mode Exit fullscreen mode

Security Controls

External ID: All cross-account role assumptions require an External ID, preventing confused deputy attacks.

Permission Boundaries: Cross-account roles have permission boundaries that cap the maximum permissions, preventing privilege escalation even if the role policy is misconfigured.

Least Privilege: Each role has only the permissions required for its specific function (e.g., S3 AP read-only, DynamoDB write-only for token store).

StackSets for Consistent Deployment

CloudFormation StackSets enable deploying the same UC template across multiple accounts with per-account parameter overrides:

# Account-specific overrides
ParameterOverrides:
  - ParameterKey: VpcId
    ParameterValue: "vpc-xxx"  # Account-specific VPC
  - ParameterKey: StorageAccountId
    ParameterValue: "111111111111"
  - ParameterKey: ExternalId
    ParameterValue: "unique-per-account-id"
  - ParameterKey: S3AccessPointAlias
    ParameterValue: "shared-fsxn-s3ap-alias"
Enter fullscreen mode Exit fullscreen mode

Cross-Account Access to S3 Access Points

For FSx for ONTAP S3 Access Points, cross-account access is modeled with S3 Access Point policies and workload-account IAM roles — not AWS RAM. S3 Access Points (including FSx ONTAP S3 AP) are not listed as RAM-shareable resource types. The cross-account pattern uses an AssumeRole chain:

  1. A storage-account IAM role (fsxn-s3ap-storage-access-role) trusted by workload accounts with an External ID condition
  2. S3 Access Point policy allowing the storage-account access role (not the workload account role directly)
  3. Workload Lambda execution role with sts:AssumeRole permission to assume the storage-account role

S3 Access Point policy (storage account) — the Principal is the storage-account role that workload accounts assume:

{
  "Statement": [{
    "Effect": "Allow",
    "Principal": {"AWS": "arn:aws:iam::STORAGE_ACCOUNT:role/fsxn-s3ap-storage-access-role"},
    "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:ap-northeast-1:STORAGE_ACCOUNT:accesspoint/my-fsxn-ap",
      "arn:aws:s3:ap-northeast-1:STORAGE_ACCOUNT:accesspoint/my-fsxn-ap/object/*"
    ]
  }]
}
Enter fullscreen mode Exit fullscreen mode

Trust policy on fsxn-s3ap-storage-access-role (storage account) — allows the workload account's specific Lambda role to assume it with External ID:

{
  "Statement": [{
    "Effect": "Allow",
    "Principal": {"AWS": "arn:aws:iam::WORKLOAD_ACCOUNT:role/workload-lambda-execution-role"},
    "Action": "sts:AssumeRole",
    "Condition": {"StringEquals": {"sts:ExternalId": "unique-external-id"}}
  }]
}
Enter fullscreen mode Exit fullscreen mode

For stricter least privilege, trust a specific workload role ARN (as shown above) instead of the workload account root principal (arn:aws:iam::WORKLOAD_ACCOUNT:root). The External ID condition provides an additional layer of confused deputy protection regardless of which principal format is used.

Note on AWS RAM: RAM can share VPC subnets, Transit Gateway, and Route 53 Resolver rules for network connectivity between accounts. S3 Access Points are not RAM-shareable. Verify resource type support in the AWS RAM supported resource types documentation.

Cross-Account Observability

CloudWatch Cross-Account Observability aggregates metrics, logs, and traces from all workload accounts into a single shared services dashboard:

  • Unified view of all UC executions across accounts
  • Cross-account X-Ray service maps
  • Centralized alerting via SNS with per-account routing

Theme D: Event-Driven Architecture Prototype

The Current State

FSx for ONTAP S3 Access Points do not currently support GetBucketNotificationConfiguration, which means we cannot receive native event notifications when files are created or modified. This is why Phases 1–3 use a polling-based architecture (EventBridge Scheduler → Step Functions → Discovery Lambda).

The Prototype

Phase 4 includes an event-driven prototype using a standard S3 bucket to demonstrate what the architecture will look like when FSx ONTAP S3 AP adds native notification support:

S3 Bucket (PutObject)
  → S3 Event Notification
    → EventBridge Rule (pattern filter: suffix=.bin, prefix=sensor-data/)
      → Step Functions (StartExecution)
        → Processing Lambdas (reused from UC11)
          → Output written to results bucket
Enter fullscreen mode Exit fullscreen mode

E2E Verification: 3.5 Seconds

In our ap-northeast-1 test environment, the event-driven prototype was verified end-to-end:

Stage Latency
S3 PutObject → EventBridge delivery ~500ms
EventBridge → Step Functions start ~200ms
Step Functions → Lambda cold start ~800ms
Lambda processing (UC11 pipeline) ~2,000ms
Total E2E ~3.5 seconds

Compare this to the polling approach: minimum 1-minute detection interval + processing time = 60+ seconds.

Processing Equivalence

The same Lambda functions produce identical output regardless of whether they're triggered by polling or events. This is validated by Property Test #11 (processing equivalence), ensuring a safe migration path.

Migration Strategy

A three-phase migration allows gradual transition when FSx ONTAP S3 AP adds native event support:

  1. Phase A: Parallel operation (both polling and event-driven active, compare outputs)
  2. Phase B: Event-driven primary (polling as consistency reconciliation fallback)
  3. Phase C: Event-driven only (polling disabled, cost savings)

Future: Native S3 AP Events

When FSx ONTAP S3 AP adds native event notification support, the expected migration path is:

  1. Replace the prototype S3 bucket event source with the S3 AP event source
  2. Update the EventBridge rule pattern or input transformation to match the new event schema
  3. Reuse the existing processing Lambdas and Step Functions where the event payload contract remains compatible

The processing Lambdas and Step Functions are designed to be reusable, but the EventBridge rule pattern and event payload mapping may need adjustment depending on the future S3 AP event schema.


Design Principles

Opt-in Everything

Every Phase 4 feature is controlled by a CloudFormation Condition:

Conditions:
  EnableDynamoDBTokenStoreCondition: !Equals [!Ref EnableDynamoDBTokenStore, "true"]
  EnableRealtimeEndpointCondition: !Equals [!Ref EnableRealtimeEndpoint, "true"]
  EnableABTestingCondition: !Equals [!Ref EnableABTesting, "true"]
  EnableModelRegistryCondition: !Equals [!Ref EnableModelRegistry, "true"]
Enter fullscreen mode Exit fullscreen mode

Default is always false. This means:

  • Zero additional cost when features are disabled
  • No breaking changes to existing Phase 1/2/3 deployments
  • Gradual feature adoption at the operator's pace

Non-Breaking Guarantee

Phase 4 additions do not modify any existing Phase 1/2/3 code:

  • New shared modules are only imported by new Phase 4 Lambdas
  • Existing Lambda functions continue to work unchanged
  • All existing tests pass without modification (573 Phase 1–3 tests + 108 new Phase 4 tests = 681 total)
  • CloudFormation templates remain backward compatible

Property-Based Testing (Hypothesis)

Phase 4 introduces 11 correctness properties:

# Property Examples Status
1 Correlation ID format invariant (8-char hex) 100
2 Task Token round-trip integrity (store → retrieve → match) 100
3 TTL calculation correctness (created_at + 86400) 100
4 Conditional write collision prevention 100
5 Mode detection by tag presence (CorrelationId vs TaskToken) 100
6 Token never logged in plaintext 100
7 Cleanup after callback (DynamoDB record deleted) 100
8 File count threshold routing (< threshold → realtime) 100
9 Variant weight normalization (sum = 1.0) 100
10 Aggregation correctness (per-variant metric math) 100
11 Processing equivalence (polling vs event-driven output) 100

Verification Results

AWS Environment Verification

Test Result Details
DynamoDB Task Token Store CRUD ✅ PASS Conditional write, TTL, round-trip
SageMaker Real-time Endpoint deploy ✅ PASS ml.m5.large, InService
Multi-Variant traffic split ✅ PASS 70/30 split verified via CloudWatch
Auto Scaling policy ✅ PASS Scale-out triggered at 70 invocations/instance
StackSets template validation ✅ PASS Cross-account parameter overrides
Cross-account S3 AP access ✅ PASS IAM role chain + S3 AP policy verified
Event-Driven E2E PASS 3.5 seconds (PutObject → processing complete)
CloudFormation validate-template ✅ PASS All templates
cfn-lint ✅ PASS 0 errors

Local Test Results

Suite Tests Status
shared/token_store/ 24 (20 unit + 4 property) ✅ All pass
shared/inference/ 18 (14 unit + 4 property) ✅ All pass
autonomous-driving (Phase 4) 38 (35 unit + 3 property) ✅ All pass
event-driven-prototype/ 28 (28 unit) ✅ All pass
Total (all phases) 681 All pass

Lessons Learned

1. SageMaker Model Requires Pre-existing Model Artifact in S3

SageMaker CreateModel expects the model artifact (e.g., model.tar.gz) to already exist in S3. Unlike training jobs that produce artifacts, real-time endpoints need the artifact uploaded before stack deployment. Our deploy script includes a upload_model_artifact() step that runs before aws cloudformation deploy.

2. CloudFormation Template > 51KB Requires S3 Bucket Deployment

As Phase 4 templates grew beyond 51KB (the inline template body limit), aws cloudformation create-stack --template-body fails. Use an S3-backed deployment path such as aws cloudformation package followed by aws cloudformation deploy, or explicitly provide --s3-bucket to aws cloudformation deploy, or use --template-url with a pre-uploaded template. Our deploy script uses --s3-bucket with a dedicated DeployBucket parameter for this.

3. VPC Lambda ENI Cleanup Takes 5–20 Minutes

When deleting CloudFormation stacks with VPC-attached Lambda functions, ENI (Elastic Network Interface) cleanup can take 5–20 minutes. CloudFormation times out waiting. Our enhanced cleanup script (scripts/cleanup_phase4.sh) proactively finds and deletes orphaned ENIs (status=available) before retrying stack deletion.

# Find orphaned Lambda ENIs
aws ec2 describe-network-interfaces \
  --filters "Name=vpc-id,Values=$VPC_ID" "Name=status,Values=available" \
  --query 'NetworkInterfaces[].NetworkInterfaceId'
Enter fullscreen mode Exit fullscreen mode

4. Security Group Cross-References Block Deletion

When SG-A references SG-B in its ingress rules, deleting SG-B fails with "has a dependent object". The cleanup script revokes cross-SG references before attempting deletion:

# Find SGs that reference the target SG
aws ec2 describe-security-groups \
  --filters "Name=ip-permission.group-id,Values=$TARGET_SG"

# Revoke the reference, then delete
aws ec2 revoke-security-group-ingress --group-id $REFERENCING_SG --ip-permissions ...
aws ec2 delete-security-group --group-id $TARGET_SG
Enter fullscreen mode Exit fullscreen mode

5. S3 Versioned Buckets Require Version-Aware Deletion

Stack deletion fails when S3 buckets have versioning enabled because aws s3 rm --recursive only deletes current versions. You must also delete all object versions and delete markers:

# Delete all versions
aws s3api list-object-versions --bucket $BUCKET \
  --query '{Objects: Versions[].{Key:Key,VersionId:VersionId}}'
aws s3api delete-objects --bucket $BUCKET --delete ...

# Delete all delete markers
aws s3api list-object-versions --bucket $BUCKET \
  --query '{Objects: DeleteMarkers[].{Key:Key,VersionId:VersionId}}'
aws s3api delete-objects --bucket $BUCKET --delete ...
Enter fullscreen mode Exit fullscreen mode

6. Event-Driven Latency: Sub-4-Second E2E

The event-driven prototype achieved 3.5-second end-to-end latency from S3 PutObject to processing complete. This is a 17x improvement over the 1-minute polling floor. The breakdown: S3 notification (~500ms) + EventBridge routing (~200ms) + Lambda cold start (~800ms) + processing (~2s). With provisioned concurrency, the cold start component can be eliminated.


What's Next

  • FSx ONTAP S3 AP native events: When available, migrate from the standard S3 prototype to production event-driven architecture with minimal changes to the processing workflow
  • Serverless Inference: Add SageMaker Serverless Inference as a third routing option for sporadic workloads
  • Cost optimization: Implement SageMaker Savings Plans and scheduled scaling for predictable workloads
  • Multi-region active-active: Extend multi-account patterns to multi-region with DynamoDB Global Tables for token store

Conclusion

Phase 4 transforms the FSxN S3AP Serverless Patterns from a demonstration project into a production-oriented reference architecture:

  • The DynamoDB Task Token Store solves a real production limitation (256-char tag limit) with an elegant 8-char correlation ID pattern
  • Multi-Variant Endpoints enable safe model iteration with automated A/B comparison metrics
  • Multi-Account templates support enterprise deployment patterns with proper security controls
  • The Event-Driven prototype charts the path forward with proven 3.5-second E2E latency

All features remain opt-in, maintaining the project's core principle: learn from the design decisions without paying for resources you don't need.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns


This article is part of the "FSx for ONTAP S3 Access Points" series. See Phase 1, Phase 2, and Phase 3 for the foundation.

Top comments (0)