TL;DR
Phase 18 restructures the entire repository from 41 flat directories into a categorized solutions/ hierarchy, adds HA LifeKeeper Monitoring as a new pattern category, introduces 5 category-specific architecture diagrams, and establishes modern Python project infrastructure. The repository now contains 42 deployable patterns organized for discoverability at scale.
Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns
Why Restructure?
With 41 pattern directories at the repository root, navigating the project had become unwieldy. New contributors could not quickly find patterns by domain, the README required extensive scrolling, and adding new categories (HA, GenAI) had no clear placement convention.
Before (flat):
legal-compliance/ financial-idp/ semiconductor-eda/ sap-erp-adjacent/
flexcache-anycast-dr/ genai-kb-selfservice-curation/ event-driven-fpolicy/
... (41 directories at root)
After (categorized):
solutions/
├── industry/ # 28 UC patterns (UC1-UC28)
├── flexcache/ # 7 FlexCache/FlexClone patterns
├── genai/ # 2 GenAI patterns (UC29-UC30)
├── sap/ # SAP/ERP pattern
├── ha/ # HA monitoring (new)
├── event-driven/ # 2 FPolicy event-driven patterns
└── edge/ # CDN/edge delivery
The Move: git mv with History Preservation
All 41 directories were moved using git mv, preserving full commit history. Key renames include:
| Old Path | New Path | Reason |
|---|---|---|
sap-erp-adjacent/ |
solutions/sap/erp-adjacent/ |
Category grouping |
dynamic-flexcache-render-workflow/ |
solutions/flexcache/dynamic-render-workflow/ |
Shorter name |
genai-kb-selfservice-curation/ |
solutions/genai/kb-selfservice-curation/ |
Strip prefix |
devops-flexclone-cicd/ |
solutions/flexcache/devops-cicd/ |
Strip prefix |
content-edge-delivery/ |
solutions/edge/content-delivery/ |
Category grouping |
Use git log --follow <file> to trace history across the move.
HA LifeKeeper Monitoring — New Pattern
SIOS LifeKeeper is a Linux/Windows HA clustering solution that can be used on Amazon EC2 for application-aware failover scenarios. With FSx for ONTAP Multi-AZ as shared storage (NFS/iSCSI, depending on OS and configuration), this pattern focuses on observing LifeKeeper logs without putting monitoring agents on the HA nodes.
The new HA pattern (solutions/ha/lifekeeper-monitoring/) provides non-intrusive log analysis:
graph TB
subgraph "HA Cluster"
LK1[LifeKeeper Node 1<br/>Active]
LK2[LifeKeeper Node 2<br/>Standby]
end
subgraph "Shared Storage"
FSXN[FSx for ONTAP Multi-AZ]
S3AP[S3 Access Point<br/>Read-only log access]
end
subgraph "Analysis Pipeline"
SFN[Step Functions]
DISC[Discovery Lambda<br/>Log classification]
PROC[Processing Lambda<br/>Bedrock Root Cause Analysis]
RPT[Report Lambda<br/>Health score + alerts]
end
LK1 -->|Log write| FSXN
FSXN --> S3AP -->|Non-intrusive read| DISC
SFN --> DISC --> PROC --> RPT
PROC -->|Nova Pro| BEDROCK[Amazon Bedrock]
RPT --> SNS[SNS Alert]
Key Design Decisions
- Non-intrusive to HA nodes: No monitoring agent is installed on HA nodes. The S3 AP read path avoids host-level changes, while still consuming FSx/S3 API throughput like any other read workload.
- Human-in-the-loop: AI analysis is advisory only. LifeKeeper's own health checks handle failover decisions.
- Health Scoring: 0-100 score with deductions for failover events, comm path latency, and resource state anomalies.
- Root Cause Analysis: Bedrock Nova Pro analyzes state transitions (ISP→OSF, ISS→ISP) to identify likely causes.
Implementation note: This pattern observes LifeKeeper logs and produces advisory analysis. It does not replace LifeKeeper cluster design, quorum/witness configuration, split-brain prevention, protocol-specific recovery kit setup, or application-level failover testing.
Example Operational Metrics
These are suggested evaluation metrics for future validation. Phase 18 verifies the DemoMode pipeline, not real-cluster failover triage time.
| Category | Metric | Demo Target |
|---|---|---|
| Operations | Time from workflow start to structured triage report | < 10 min |
| Technical | Log discovery completeness | All configured LifeKeeper log paths in scope |
| Quality | False positive alert rate | < 5% (requires real-cluster validation) |
| Cost | Monthly monitoring cost | < $15 (5-min polling, 1 cluster) |
Deploy (DemoMode)
DemoMode=true deploys without FSx for ONTAP — uses a regular S3 bucket with sample LifeKeeper logs:
cd solutions/ha/lifekeeper-monitoring
sam build && sam deploy --guided \
--parameter-overrides DemoMode=true \
S3AccessPointAlias=your-demo-bucket \
OutputBucketName=your-output-bucket
DemoMode verification confirms: SAM template deploys successfully, Step Functions workflow executes all states, Discovery Lambda classifies sample logs correctly, and Processing Lambda generates a health score report. Bedrock analysis produces structured, advisory observations based on log patterns; failover decisions remain outside the AI workflow.
Verified in ap-northeast-1 on 2026-06-21. The sample log set intentionally contains failure-like events, resulting in a demo health score of 40/100. The score is not a benchmark for LifeKeeper or FSx for ONTAP; it demonstrates that the scoring pipeline detects and reports anomalies from sample logs.
Coming in Phase 19: Full E2E verification with a real SIOS LifeKeeper HA cluster (AWS Marketplace + FSx for ONTAP Multi-AZ) including live failover testing, actual state transition detection, and Bedrock RCA quality assessment against real-world failure scenarios.
Category Architecture Diagrams
Phase 18 adds 5 mermaid architecture diagrams to README.md (both JA and EN), each in a collapsible <details> block:
| Category | Key Components |
|---|---|
| 🏭 FlexCache | ONTAP REST API → HealthCheck → RouteDecision → DynamoDB routing → Create/Cleanup lifecycle |
| 🤖 GenAI | FPolicy → SQS → EventBridge → Bedrock KB → RetrieveAndGenerate / Agentic tools |
| 🛡️ HA | S3 AP non-intrusive read → Bedrock RCA → Health score → SNS alerts |
| ⚡ Event-Driven | FPolicy Engine → ECS Fargate TCP → SQS → EventBridge rule routing |
| 🌐 Edge/CDN | 3 delivery modes (ORIGIN_PULL, OAC, PUBLISH_PUSH) → vendor-neutral CDN |
All diagrams include accTitle and accDescr for screen reader accessibility.
Project Infrastructure Improvements
pyproject.toml (PEP 621)
Modern Python project metadata with unified tool configuration:
[tool.ruff]
target-version = "py312"
line-length = 120
[tool.pytest.ini_options]
addopts = "-v --tb=short --import-mode=importlib"
[tool.coverage.report]
fail_under = 80
Dependency Pinning
# requirements.txt — exact versions for reproducibility
boto3==1.43.29
urllib3==2.7.0
jsonschema==4.17.3
# requirements-dev.txt
pytest==9.1.0
hypothesis==6.155.2
moto==5.2.2
ruff==0.15.17
cfn-lint==1.51.4
.cfnlintrc
Project-wide cfn-lint configuration that discovers all templates under solutions/:
templates:
- "solutions/**/template.yaml"
- "solutions/**/template-deploy.yaml"
ignore_checks:
- W3002 # Local CodeUri (sam build handles upload)
- W1031 # Fn::Sub false positive with Secrets Manager ARNs
regions:
- ap-northeast-1
Additional Tooling
| File | Purpose |
|---|---|
.gitattributes |
Consistent line endings, language-specific diff drivers |
.github/PULL_REQUEST_TEMPLATE.md |
Project-specific PR checklist |
solutions/README.md |
Category navigation index |
CHANGELOG.md |
Keep a Changelog format, all releases |
CONTRIBUTING.md |
"Adding a New Pattern" section with category guide |
CI/CD Changes
The CI pipeline was split to avoid pytest importlib mode namespace collisions when multiple patterns with identically-named handler.py files are collected together:
# Before: single pytest invocation (collision risk)
pytest shared/tests/ solutions/**/tests/ --cov=shared
# After: isolated invocations
- name: Run shared tests with coverage
run: pytest shared/tests/ --cov=shared
- name: Run pattern tests
run: pytest solutions/industry/*/tests/ solutions/flexcache/*/tests/ ...
Also added persist-credentials: false to all actions/checkout steps (zizmor security hardening).
Multi-Perspective Review
The restructuring and HA monitoring pattern were reviewed from partner delivery, storage architecture, HA operations, security, CI/CD, accessibility, and contributor onboarding perspectives. The review resulted in wording changes around neutrality, operational caveats, connector validation, HA safety boundaries, and repository discoverability.
What This Means for You
| Role | Impact |
|---|---|
| Partner / SI / delivery team | Pattern selection is now intuitive by category — pick solutions/industry/ for industry PoCs, solutions/flexcache/ for distributed workloads |
| New contributor |
CONTRIBUTING.md now includes "Adding a New Pattern" with required checklist and category selection guide |
| Existing repository users |
git log --follow <file> still works. sam build and sam deploy are unchanged (run from the pattern directory) |
| CI/CD | Zero changes needed in your samconfig.toml or deployment scripts — CodeUri is relative to template.yaml |
Infrastructure cost: Zero. This is a repository organization change; it does not deploy AWS resources by itself.
What's Next
- Phase 16-17 blog articles (GenAI patterns) — ready to publish
- dev.to series updated for directory restructuring (Phase 13 link fixed)
- Next pattern candidates under evaluation
Getting Started
git clone https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns.git
cd FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns
# Browse patterns by category
ls solutions/
# Deploy HA LifeKeeper monitoring (DemoMode)
cd solutions/ha/lifekeeper-monitoring
sam build && sam deploy --guided --parameter-overrides DemoMode=true
# Run tests
make test-quick PYTHON=.venv/bin/python
Yoshiki Fujiwara

Top comments (0)