Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on Jun 21

42 Patterns, Category Architecture, and HA LifeKeeper Monitoring — FSx for ONTAP S3 Access Points, Phase 18

#aws #lifekeeper #amazonfsxfornetappontap #s3accesspoints

TL;DR

Phase 18 restructures the entire repository from 41 flat directories into a categorized solutions/ hierarchy, adds HA LifeKeeper Monitoring as a new pattern category, introduces 5 category-specific architecture diagrams, and establishes modern Python project infrastructure. The repository now contains 42 deployable patterns organized for discoverability at scale.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

Why Restructure?

With 41 pattern directories at the repository root, navigating the project had become unwieldy. New contributors could not quickly find patterns by domain, the README required extensive scrolling, and adding new categories (HA, GenAI) had no clear placement convention.

Before (flat):

legal-compliance/  financial-idp/  semiconductor-eda/  sap-erp-adjacent/
flexcache-anycast-dr/  genai-kb-selfservice-curation/  event-driven-fpolicy/
... (41 directories at root)

After (categorized):

solutions/
├── industry/           # 28 UC patterns (UC1-UC28)
├── flexcache/          # 7 FlexCache/FlexClone patterns
├── genai/              # 2 GenAI patterns (UC29-UC30)
├── sap/                # SAP/ERP pattern
├── ha/                 # HA monitoring (new)
├── event-driven/       # 2 FPolicy event-driven patterns
└── edge/               # CDN/edge delivery

The Move: git mv with History Preservation

All 41 directories were moved using git mv, preserving full commit history. Key renames include:

Old Path	New Path	Reason
`sap-erp-adjacent/`	`solutions/sap/erp-adjacent/`	Category grouping
`dynamic-flexcache-render-workflow/`	`solutions/flexcache/dynamic-render-workflow/`	Shorter name
`genai-kb-selfservice-curation/`	`solutions/genai/kb-selfservice-curation/`	Strip prefix
`devops-flexclone-cicd/`	`solutions/flexcache/devops-cicd/`	Strip prefix
`content-edge-delivery/`	`solutions/edge/content-delivery/`	Category grouping

Use git log --follow <file> to trace history across the move.

HA LifeKeeper Monitoring — New Pattern

SIOS LifeKeeper is a Linux/Windows HA clustering solution that can be used on Amazon EC2 for application-aware failover scenarios. With FSx for ONTAP Multi-AZ as shared storage (NFS/iSCSI, depending on OS and configuration), this pattern focuses on observing LifeKeeper logs without putting monitoring agents on the HA nodes.

The new HA pattern (solutions/ha/lifekeeper-monitoring/) provides non-intrusive log analysis:

graph TB
    subgraph "HA Cluster"
        LK1[LifeKeeper Node 1<br/>Active]
        LK2[LifeKeeper Node 2<br/>Standby]
    end

    subgraph "Shared Storage"
        FSXN[FSx for ONTAP Multi-AZ]
        S3AP[S3 Access Point<br/>Read-only log access]
    end

    subgraph "Analysis Pipeline"
        SFN[Step Functions]
        DISC[Discovery Lambda<br/>Log classification]
        PROC[Processing Lambda<br/>Bedrock Root Cause Analysis]
        RPT[Report Lambda<br/>Health score + alerts]
    end

    LK1 -->|Log write| FSXN
    FSXN --> S3AP -->|Non-intrusive read| DISC
    SFN --> DISC --> PROC --> RPT
    PROC -->|Nova Pro| BEDROCK[Amazon Bedrock]
    RPT --> SNS[SNS Alert]

Key Design Decisions

Non-intrusive to HA nodes: No monitoring agent is installed on HA nodes. The S3 AP read path avoids host-level changes, while still consuming FSx/S3 API throughput like any other read workload.
Human-in-the-loop: AI analysis is advisory only. LifeKeeper's own health checks handle failover decisions.
Health Scoring: 0-100 score with deductions for failover events, comm path latency, and resource state anomalies.
Root Cause Analysis: Bedrock Nova Pro analyzes state transitions (ISP→OSF, ISS→ISP) to identify likely causes.

Implementation note: This pattern observes LifeKeeper logs and produces advisory analysis. It does not replace LifeKeeper cluster design, quorum/witness configuration, split-brain prevention, protocol-specific recovery kit setup, or application-level failover testing.

Example Operational Metrics

These are suggested evaluation metrics for future validation. Phase 18 verifies the DemoMode pipeline, not real-cluster failover triage time.

Category	Metric	Demo Target
Operations	Time from workflow start to structured triage report	< 10 min
Technical	Log discovery completeness	All configured LifeKeeper log paths in scope
Quality	False positive alert rate	< 5% (requires real-cluster validation)
Cost	Monthly monitoring cost	< $15 (5-min polling, 1 cluster)

Deploy (DemoMode)

DemoMode=true deploys without FSx for ONTAP — uses a regular S3 bucket with sample LifeKeeper logs:

cd solutions/ha/lifekeeper-monitoring
sam build && sam deploy --guided \
  --parameter-overrides DemoMode=true \
    S3AccessPointAlias=your-demo-bucket \
    OutputBucketName=your-output-bucket

DemoMode verification confirms: SAM template deploys successfully, Step Functions workflow executes all states, Discovery Lambda classifies sample logs correctly, and Processing Lambda generates a health score report. Bedrock analysis produces structured, advisory observations based on log patterns; failover decisions remain outside the AI workflow.

Verified in ap-northeast-1 on 2026-06-21. The sample log set intentionally contains failure-like events, resulting in a demo health score of 40/100. The score is not a benchmark for LifeKeeper or FSx for ONTAP; it demonstrates that the scoring pipeline detects and reports anomalies from sample logs.

Coming in Phase 19: Full E2E verification with a real SIOS LifeKeeper HA cluster (AWS Marketplace + FSx for ONTAP Multi-AZ) including live failover testing, actual state transition detection, and Bedrock RCA quality assessment against real-world failure scenarios.

Category Architecture Diagrams

Phase 18 adds 5 mermaid architecture diagrams to README.md (both JA and EN), each in a collapsible <details> block:

Category	Key Components
🏭 FlexCache	ONTAP REST API → HealthCheck → RouteDecision → DynamoDB routing → Create/Cleanup lifecycle
🤖 GenAI	FPolicy → SQS → EventBridge → Bedrock KB → RetrieveAndGenerate / Agentic tools
🛡️ HA	S3 AP non-intrusive read → Bedrock RCA → Health score → SNS alerts
⚡ Event-Driven	FPolicy Engine → ECS Fargate TCP → SQS → EventBridge rule routing
🌐 Edge/CDN	3 delivery modes (ORIGIN_PULL, OAC, PUBLISH_PUSH) → vendor-neutral CDN

All diagrams include accTitle and accDescr for screen reader accessibility.

Project Infrastructure Improvements

pyproject.toml (PEP 621)

Modern Python project metadata with unified tool configuration:

[tool.ruff]
target-version = "py312"
line-length = 120

[tool.pytest.ini_options]
addopts = "-v --tb=short --import-mode=importlib"

[tool.coverage.report]
fail_under = 80

Dependency Pinning

# requirements.txt — exact versions for reproducibility
boto3==1.43.29
urllib3==2.7.0
jsonschema==4.17.3

# requirements-dev.txt
pytest==9.1.0
hypothesis==6.155.2
moto==5.2.2
ruff==0.15.17
cfn-lint==1.51.4

.cfnlintrc

Project-wide cfn-lint configuration that discovers all templates under solutions/:

templates:
  - "solutions/**/template.yaml"
  - "solutions/**/template-deploy.yaml"
ignore_checks:
  - W3002  # Local CodeUri (sam build handles upload)
  - W1031  # Fn::Sub false positive with Secrets Manager ARNs
regions:
  - ap-northeast-1

Additional Tooling

File	Purpose
`.gitattributes`	Consistent line endings, language-specific diff drivers
`.github/PULL_REQUEST_TEMPLATE.md`	Project-specific PR checklist
`solutions/README.md`	Category navigation index
`CHANGELOG.md`	Keep a Changelog format, all releases
`CONTRIBUTING.md`	"Adding a New Pattern" section with category guide

CI/CD Changes

The CI pipeline was split to avoid pytest importlib mode namespace collisions when multiple patterns with identically-named handler.py files are collected together:

# Before: single pytest invocation (collision risk)
pytest shared/tests/ solutions/**/tests/ --cov=shared

# After: isolated invocations
- name: Run shared tests with coverage
  run: pytest shared/tests/ --cov=shared
- name: Run pattern tests
  run: pytest solutions/industry/*/tests/ solutions/flexcache/*/tests/ ...

Also added persist-credentials: false to all actions/checkout steps (zizmor security hardening).

Multi-Perspective Review

The restructuring and HA monitoring pattern were reviewed from partner delivery, storage architecture, HA operations, security, CI/CD, accessibility, and contributor onboarding perspectives. The review resulted in wording changes around neutrality, operational caveats, connector validation, HA safety boundaries, and repository discoverability.

What This Means for You

Role	Impact
Partner / SI / delivery team	Pattern selection is now intuitive by category — pick `solutions/industry/` for industry PoCs, `solutions/flexcache/` for distributed workloads
New contributor	`CONTRIBUTING.md` now includes "Adding a New Pattern" with required checklist and category selection guide
Existing repository users	`git log --follow <file>` still works. `sam build` and `sam deploy` are unchanged (run from the pattern directory)
CI/CD	Zero changes needed in your `samconfig.toml` or deployment scripts — CodeUri is relative to template.yaml

Infrastructure cost: Zero. This is a repository organization change; it does not deploy AWS resources by itself.

What's Next

Phase 16-17 blog articles (GenAI patterns) — ready to publish
dev.to series updated for directory restructuring (Phase 13 link fixed)
Next pattern candidates under evaluation

Getting Started

git clone https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns.git
cd FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

# Browse patterns by category
ls solutions/

# Deploy HA LifeKeeper monitoring (DemoMode)
cd solutions/ha/lifekeeper-monitoring
sam build && sam deploy --guided --parameter-overrides DemoMode=true

# Run tests
make test-quick PYTHON=.venv/bin/python

Yoshiki Fujiwara

DEV Community