DEV Community

Cover image for 42 Patterns, Category Architecture, and HA LifeKeeper Monitoring — FSx for ONTAP S3 Access Points, Phase 18

42 Patterns, Category Architecture, and HA LifeKeeper Monitoring — FSx for ONTAP S3 Access Points, Phase 18

TL;DR

Phase 18 restructures the entire repository from 41 flat directories into a categorized solutions/ hierarchy, adds HA LifeKeeper Monitoring as a new pattern category, introduces 5 category-specific architecture diagrams, and establishes modern Python project infrastructure. The repository now contains 42 deployable patterns organized for discoverability at scale.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns


Why Restructure?

With 41 pattern directories at the repository root, navigating the project had become unwieldy. New contributors could not quickly find patterns by domain, the README required extensive scrolling, and adding new categories (HA, GenAI) had no clear placement convention.

Before (flat):

legal-compliance/  financial-idp/  semiconductor-eda/  sap-erp-adjacent/
flexcache-anycast-dr/  genai-kb-selfservice-curation/  event-driven-fpolicy/
... (41 directories at root)
Enter fullscreen mode Exit fullscreen mode

After (categorized):

solutions/
├── industry/           # 28 UC patterns (UC1-UC28)
├── flexcache/          # 7 FlexCache/FlexClone patterns
├── genai/              # 2 GenAI patterns (UC29-UC30)
├── sap/                # SAP/ERP pattern
├── ha/                 # HA monitoring (new)
├── event-driven/       # 2 FPolicy event-driven patterns
└── edge/               # CDN/edge delivery
Enter fullscreen mode Exit fullscreen mode

The Move: git mv with History Preservation

All 41 directories were moved using git mv, preserving full commit history. Key renames include:

Old Path New Path Reason
sap-erp-adjacent/ solutions/sap/erp-adjacent/ Category grouping
dynamic-flexcache-render-workflow/ solutions/flexcache/dynamic-render-workflow/ Shorter name
genai-kb-selfservice-curation/ solutions/genai/kb-selfservice-curation/ Strip prefix
devops-flexclone-cicd/ solutions/flexcache/devops-cicd/ Strip prefix
content-edge-delivery/ solutions/edge/content-delivery/ Category grouping

Use git log --follow <file> to trace history across the move.


HA LifeKeeper Monitoring — New Pattern

SIOS LifeKeeper is a Linux/Windows HA clustering solution that can be used on Amazon EC2 for application-aware failover scenarios. With FSx for ONTAP Multi-AZ as shared storage (NFS/iSCSI, depending on OS and configuration), this pattern focuses on observing LifeKeeper logs without putting monitoring agents on the HA nodes.

The new HA pattern (solutions/ha/lifekeeper-monitoring/) provides non-intrusive log analysis:

graph TB
    subgraph "HA Cluster"
        LK1[LifeKeeper Node 1<br/>Active]
        LK2[LifeKeeper Node 2<br/>Standby]
    end

    subgraph "Shared Storage"
        FSXN[FSx for ONTAP Multi-AZ]
        S3AP[S3 Access Point<br/>Read-only log access]
    end

    subgraph "Analysis Pipeline"
        SFN[Step Functions]
        DISC[Discovery Lambda<br/>Log classification]
        PROC[Processing Lambda<br/>Bedrock Root Cause Analysis]
        RPT[Report Lambda<br/>Health score + alerts]
    end

    LK1 -->|Log write| FSXN
    FSXN --> S3AP -->|Non-intrusive read| DISC
    SFN --> DISC --> PROC --> RPT
    PROC -->|Nova Pro| BEDROCK[Amazon Bedrock]
    RPT --> SNS[SNS Alert]
Enter fullscreen mode Exit fullscreen mode

Key Design Decisions

  1. Non-intrusive to HA nodes: No monitoring agent is installed on HA nodes. The S3 AP read path avoids host-level changes, while still consuming FSx/S3 API throughput like any other read workload.
  2. Human-in-the-loop: AI analysis is advisory only. LifeKeeper's own health checks handle failover decisions.
  3. Health Scoring: 0-100 score with deductions for failover events, comm path latency, and resource state anomalies.
  4. Root Cause Analysis: Bedrock Nova Pro analyzes state transitions (ISP→OSF, ISS→ISP) to identify likely causes.

Implementation note: This pattern observes LifeKeeper logs and produces advisory analysis. It does not replace LifeKeeper cluster design, quorum/witness configuration, split-brain prevention, protocol-specific recovery kit setup, or application-level failover testing.

Example Operational Metrics

These are suggested evaluation metrics for future validation. Phase 18 verifies the DemoMode pipeline, not real-cluster failover triage time.

Category Metric Demo Target
Operations Time from workflow start to structured triage report < 10 min
Technical Log discovery completeness All configured LifeKeeper log paths in scope
Quality False positive alert rate < 5% (requires real-cluster validation)
Cost Monthly monitoring cost < $15 (5-min polling, 1 cluster)

Deploy (DemoMode)

DemoMode=true deploys without FSx for ONTAP — uses a regular S3 bucket with sample LifeKeeper logs:

cd solutions/ha/lifekeeper-monitoring
sam build && sam deploy --guided \
  --parameter-overrides DemoMode=true \
    S3AccessPointAlias=your-demo-bucket \
    OutputBucketName=your-output-bucket
Enter fullscreen mode Exit fullscreen mode

DemoMode verification confirms: SAM template deploys successfully, Step Functions workflow executes all states, Discovery Lambda classifies sample logs correctly, and Processing Lambda generates a health score report. Bedrock analysis produces structured, advisory observations based on log patterns; failover decisions remain outside the AI workflow.

Step Functions graph view — HA LifeKeeper Monitoring workflow completed successfully

Verified in ap-northeast-1 on 2026-06-21. The sample log set intentionally contains failure-like events, resulting in a demo health score of 40/100. The score is not a benchmark for LifeKeeper or FSx for ONTAP; it demonstrates that the scoring pipeline detects and reports anomalies from sample logs.

Coming in Phase 19: Full E2E verification with a real SIOS LifeKeeper HA cluster (AWS Marketplace + FSx for ONTAP Multi-AZ) including live failover testing, actual state transition detection, and Bedrock RCA quality assessment against real-world failure scenarios.


Category Architecture Diagrams

Phase 18 adds 5 mermaid architecture diagrams to README.md (both JA and EN), each in a collapsible <details> block:

Category Key Components
🏭 FlexCache ONTAP REST API → HealthCheck → RouteDecision → DynamoDB routing → Create/Cleanup lifecycle
🤖 GenAI FPolicy → SQS → EventBridge → Bedrock KB → RetrieveAndGenerate / Agentic tools
🛡️ HA S3 AP non-intrusive read → Bedrock RCA → Health score → SNS alerts
⚡ Event-Driven FPolicy Engine → ECS Fargate TCP → SQS → EventBridge rule routing
🌐 Edge/CDN 3 delivery modes (ORIGIN_PULL, OAC, PUBLISH_PUSH) → vendor-neutral CDN

All diagrams include accTitle and accDescr for screen reader accessibility.


Project Infrastructure Improvements

pyproject.toml (PEP 621)

Modern Python project metadata with unified tool configuration:

[tool.ruff]
target-version = "py312"
line-length = 120

[tool.pytest.ini_options]
addopts = "-v --tb=short --import-mode=importlib"

[tool.coverage.report]
fail_under = 80
Enter fullscreen mode Exit fullscreen mode

Dependency Pinning

# requirements.txt — exact versions for reproducibility
boto3==1.43.29
urllib3==2.7.0
jsonschema==4.17.3

# requirements-dev.txt
pytest==9.1.0
hypothesis==6.155.2
moto==5.2.2
ruff==0.15.17
cfn-lint==1.51.4
Enter fullscreen mode Exit fullscreen mode

.cfnlintrc

Project-wide cfn-lint configuration that discovers all templates under solutions/:

templates:
  - "solutions/**/template.yaml"
  - "solutions/**/template-deploy.yaml"
ignore_checks:
  - W3002  # Local CodeUri (sam build handles upload)
  - W1031  # Fn::Sub false positive with Secrets Manager ARNs
regions:
  - ap-northeast-1
Enter fullscreen mode Exit fullscreen mode

Additional Tooling

File Purpose
.gitattributes Consistent line endings, language-specific diff drivers
.github/PULL_REQUEST_TEMPLATE.md Project-specific PR checklist
solutions/README.md Category navigation index
CHANGELOG.md Keep a Changelog format, all releases
CONTRIBUTING.md "Adding a New Pattern" section with category guide

CI/CD Changes

The CI pipeline was split to avoid pytest importlib mode namespace collisions when multiple patterns with identically-named handler.py files are collected together:

# Before: single pytest invocation (collision risk)
pytest shared/tests/ solutions/**/tests/ --cov=shared

# After: isolated invocations
- name: Run shared tests with coverage
  run: pytest shared/tests/ --cov=shared
- name: Run pattern tests
  run: pytest solutions/industry/*/tests/ solutions/flexcache/*/tests/ ...
Enter fullscreen mode Exit fullscreen mode

Also added persist-credentials: false to all actions/checkout steps (zizmor security hardening).


Multi-Perspective Review

The restructuring and HA monitoring pattern were reviewed from partner delivery, storage architecture, HA operations, security, CI/CD, accessibility, and contributor onboarding perspectives. The review resulted in wording changes around neutrality, operational caveats, connector validation, HA safety boundaries, and repository discoverability.


What This Means for You

Role Impact
Partner / SI / delivery team Pattern selection is now intuitive by category — pick solutions/industry/ for industry PoCs, solutions/flexcache/ for distributed workloads
New contributor CONTRIBUTING.md now includes "Adding a New Pattern" with required checklist and category selection guide
Existing repository users git log --follow <file> still works. sam build and sam deploy are unchanged (run from the pattern directory)
CI/CD Zero changes needed in your samconfig.toml or deployment scripts — CodeUri is relative to template.yaml

Infrastructure cost: Zero. This is a repository organization change; it does not deploy AWS resources by itself.


What's Next

  • Phase 16-17 blog articles (GenAI patterns) — ready to publish
  • dev.to series updated for directory restructuring (Phase 13 link fixed)
  • Next pattern candidates under evaluation

Getting Started

git clone https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns.git
cd FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

# Browse patterns by category
ls solutions/

# Deploy HA LifeKeeper monitoring (DemoMode)
cd solutions/ha/lifekeeper-monitoring
sam build && sam deploy --guided --parameter-overrides DemoMode=true

# Run tests
make test-quick PYTHON=.venv/bin/python
Enter fullscreen mode Exit fullscreen mode

Yoshiki Fujiwara

Top comments (0)