1. Solution Overview
The proposed solution is a cloud-native, multi-tenant business directory platform built on AWS using a hybrid microservices and serverless architecture. This platform enables businesses to list their services, users to search and discover local businesses, and provides monetization through premium listings, advertisements, and subscription tiers.
Key Business Objectives:
- Deliver highly available search and discovery experience with 99.95% uptime
- Support millions of business listings with real-time updates
- Enable geospatial search with sub-second response times
- Scale elastically based on traffic patterns (peak/off-peak)
- Minimize operational overhead through managed services
- Support multi-region deployment for global reach
Architectural Approach: Event-driven microservices with serverless components for cost optimization, leveraging managed services for search (OpenSearch), caching (ElastiCache), and databases (Aurora PostgreSQL + DynamoDB).
2. Architecture Components
AWS Services & Resources
Compute Layer
-
Amazon ECS on Fargate (serverless containers)
- API Gateway Service: 2 vCPU, 4GB RAM, auto-scale 2-20 tasks
- Business Management Service: 2 vCPU, 4GB RAM, auto-scale 2-15 tasks
- User Service: 1 vCPU, 2GB RAM, auto-scale 2-10 tasks
- Review & Rating Service: 1 vCPU, 2GB RAM, auto-scale 2-10 tasks
-
AWS Lambda (event-driven functions)
- Image processing: 1024MB, 60s timeout
- Search indexing: 512MB, 30s timeout
- Email notifications: 256MB, 15s timeout
- Analytics aggregation: 1024MB, 120s timeout
Storage Layer
-
Amazon S3
- Business images/logos: S3 Standard (with lifecycle to Glacier after 1 year)
- Static website assets: S3 Standard with CloudFront CDN
- Backups: S3 Intelligent-Tiering
- Bucket policies: Versioning enabled, encryption at rest (SSE-S3)
-
Amazon EBS
- gp3 volumes for OpenSearch nodes (200GB per node)
Database Layer
-
Amazon Aurora PostgreSQL (version 15.x)
- Primary DB: db.r6g.xlarge (4 vCPU, 32GB RAM) - Multi-AZ
- Read replicas: 2x db.r6g.large (2 vCPU, 16GB RAM)
- Database: Business listings, user accounts, subscriptions, transactions
- Aurora I/O-Optimized configuration for predictable costs
-
Amazon DynamoDB
- User sessions (on-demand capacity)
- Real-time analytics counters (provisioned: 50 RCU, 25 WCU)
- Business activity logs (on-demand capacity)
-
Amazon OpenSearch Service
- Domain: business-directory-search
- Master nodes: 3x c6g.large.search (2 vCPU, 4GB RAM)
- Data nodes: 6x r6g.xlarge.search (4 vCPU, 32GB RAM, 200GB gp3 each)
- Multi-AZ with 1 replica per index
-
Amazon ElastiCache for Redis
- cache.r6g.large (2 nodes, cluster mode enabled)
- Cache: API responses, session data, frequently accessed listings
Networking Layer
-
Amazon VPC
- CIDR: 10.0.0.0/16
- Public Subnets: 10.0.1.0/24 (AZ-a), 10.0.2.0/24 (AZ-b), 10.0.3.0/24 (AZ-c)
- Private Subnets (App): 10.0.11.0/24 (AZ-a), 10.0.12.0/24 (AZ-b), 10.0.13.0/24 (AZ-c)
- Private Subnets (Data): 10.0.21.0/24 (AZ-a), 10.0.22.0/24 (AZ-b), 10.0.23.0/24 (AZ-c)
- NAT Gateways: 3 (one per AZ for high availability)
-
Application Load Balancer (ALB)
- Internet-facing ALB for web traffic
- Internal ALB for microservices communication
- SSL/TLS termination with ACM certificates
-
Amazon CloudFront
- Global CDN for static assets, images, and API caching
- Custom domain with Route 53 integration
-
Amazon Route 53
- Hosted zone for domain management
- Geolocation routing for multi-region setup
- Health checks for failover
Security Services
-
AWS IAM
- Service roles for ECS tasks, Lambda functions
- OIDC provider for GitHub Actions CI/CD
- Least privilege policies for all resources
-
AWS Secrets Manager
- Database credentials rotation (every 30 days)
- API keys for third-party integrations
- Encryption keys management
-
AWS KMS
- Customer-managed keys for S3, RDS, DynamoDB encryption
- Separate keys per environment (dev, staging, prod)
-
AWS WAF
- Rate limiting: 2000 requests per 5 minutes per IP
- SQL injection and XSS protection rules
- Geo-blocking for specific countries (if needed)
- AWS Shield Standard (included by default)
- AWS GuardDuty (threat detection)
- AWS Security Hub (compliance monitoring)
Monitoring & Logging
-
Amazon CloudWatch
- Logs: Centralized logging for all services (retention: 30 days)
- Metrics: Custom metrics for business KPIs
- Alarms: CPU, memory, disk, latency, error rates
- Dashboards: Real-time operational visibility
- AWS X-Ray (distributed tracing)
- AWS CloudTrail (API audit logging, 90-day retention)
CI/CD Pipeline
- AWS CodePipeline (orchestration)
- AWS CodeBuild (build and test)
- AWS CodeDeploy (deployment to ECS)
- Amazon ECR (container registry)
Other Managed Services
- Amazon SES (transactional emails)
- Amazon SNS (notifications, alerts)
- Amazon SQS (message queuing for async processing)
- Amazon EventBridge (event routing)
- AWS Backup (centralized backup management)
Infrastructure-as-Code Tools
Primary IaC: Terraform (recommended for multi-cloud portability and mature ecosystem)
- Terraform v1.6+ with AWS Provider v5.x
- State management: S3 backend with DynamoDB state locking
- Modular structure: VPC, ECS, RDS, OpenSearch, monitoring modules
- Environment management: Workspaces for dev/staging/prod
- Secret management: Terraform Cloud or SOPS for sensitive variables
Alternative: AWS CDK (TypeScript) for teams preferring programmatic infrastructure
Configuration Management:
- AWS Systems Manager Parameter Store for application configuration
- AWS AppConfig for feature flags and dynamic configuration
Third-Party Tools/Platforms
Container Orchestration:
- ECS Fargate (managed, no Kubernetes overhead needed for this use case)
- Docker Engine 24.x for local development
- Docker Compose for local multi-service testing
CI/CD Platform:
- GitHub Actions (primary - free for public repos, integrated with AWS OIDC)
- Alternative: GitLab CI or Jenkins for on-premise integration
Monitoring & Observability:
- Datadog or New Relic (optional, for enhanced APM)
- Grafana (self-hosted or Grafana Cloud) for custom dashboards
- Prometheus (for Kubernetes if migrating from ECS in future)
SaaS Integrations:
- Stripe for payment processing (subscription management)
- Twilio for SMS notifications (optional)
- Google Maps API or Mapbox for geocoding and maps
- Algolia (optional alternative to OpenSearch for simpler search needs)
- SendGrid (backup email provider)
Programming Languages & Frameworks
Backend Services:
- Node.js 20.x LTS with Express.js or NestJS (microservices framework)
- Python 3.11+ with FastAPI (for ML/analytics services)
- Go 1.21+ (for high-performance services like search indexing)
Frontend:
- React 18+ with Next.js 14 (SSR/SSG for SEO)
- TypeScript 5.x (type safety)
- Tailwind CSS or Material-UI for styling
Mobile (Optional Future Phase):
- React Native or Flutter for cross-platform apps
Scripting & Automation:
- Bash/Shell for deployment scripts
- Python for data migration and ETL jobs
- Node.js for Lambda functions
Libraries & Frameworks:
- Sequelize/TypeORM (ORM for PostgreSQL)
- AWS SDK (JavaScript, Python, Go)
- OpenSearch JavaScript Client
- Redis Client (ioredis)
- Jest/Mocha (unit testing)
- Cypress/Playwright (E2E testing)
Hardware/Compute Specifications
ECS Fargate Task Specifications:
- API Gateway Service: 2 vCPU, 4GB RAM (handles routing, authentication)
- Business Service: 2 vCPU, 4GB RAM (CRUD operations, complex queries)
- User Service: 1 vCPU, 2GB RAM (lightweight user operations)
- Review Service: 1 vCPU, 2GB RAM (moderate load)
Auto-scaling Configuration:
- Target CPU Utilization: 70%
- Target Memory Utilization: 80%
- Scale-out cooldown: 60 seconds
- Scale-in cooldown: 300 seconds
- Min tasks: 2 per service (high availability)
- Max tasks: 10-20 per service (based on load testing)
Lambda Configuration:
- Image Processing: 1024MB, 60s timeout (handles image resize/optimization)
- Search Indexing: 512MB, 30s timeout (bulk indexing to OpenSearch)
- Email Service: 256MB, 15s timeout (SES integration)
- Analytics: 1024MB, 120s timeout (aggregation jobs)
Database Sizing:
- Aurora Primary: db.r6g.xlarge (4 vCPU, 32GB RAM) - handles 500-1000 TPS
- Aurora Replicas: 2x db.r6g.large - distributes read load
- Auto-scaling: Read replicas scale 2-5 based on CPU > 75%
OpenSearch Cluster:
- Master Nodes: 3x c6g.large.search (dedicated for cluster management)
- Data Nodes: 6x r6g.xlarge.search (search and indexing operations)
- Storage per node: 200GB gp3 (total 1.2TB usable storage)
- Replicas: 1 per index (2x storage requirement)
3. Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────────┐
│ USER LAYER (Global) │
│ [Web Browser] [Mobile App] [API Clients] │
└────────────────────────────┬────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ CONTENT DELIVERY NETWORK │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Amazon CloudFront (Global Edge Locations) │ │
│ │ - Static Assets Caching │ │
│ │ - API Response Caching (optional) │ │
│ │ - SSL/TLS Termination │ │
│ └────────────────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ DNS & ROUTING LAYER │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Amazon Route 53 │ │
│ │ - Health Checks - Geolocation Routing - Failover │ │
│ └────────────────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ SECURITY PERIMETER │
│ ┌────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │
│ │ AWS WAF │ │ AWS Shield │ │ AWS GuardDuty │ │
│ │ - Rate Limit │ │ - DDoS │ │ - Threat Det. │ │
│ │ - SQL Inject. │ │ Protection │ │ │ │
│ └────────────────┘ └──────────────────┘ └─────────────────┘ │
└────────────────────────────┬────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ AWS REGION (us-east-1 / Primary) │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ VPC (10.0.0.0/16) │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ PUBLIC SUBNETS (Multi-AZ) │ │ │
│ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │
│ │ │ │ 10.0.1.0/24 │ │ 10.0.2.0/24 │ │ 10.0.3.0/24 │ │ │ │
│ │ │ │ (AZ-a) │ │ (AZ-b) │ │ (AZ-c) │ │ │ │
│ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │
│ │ │ │ │ │ │ │ │
│ │ │ [NAT GW-a] [NAT GW-b] [NAT GW-c] │ │ │
│ │ │ │ │ │ │ │ │
│ │ │ ┌──────┴──────────────────┴──────────────────┴───────┐ │ │ │
│ │ │ │ Application Load Balancer (ALB) │ │ │ │
│ │ │ │ - SSL Termination (ACM Certificate) │ │ │ │
│ │ │ │ - Target Groups for ECS Services │ │ │ │
│ │ │ └──────────────────────┬─────────────────────────────┘ │ │ │
│ │ └─────────────────────────┼───────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌─────────────────────────┼───────────────────────────────────┐ │ │
│ │ │ PRIVATE SUBNETS - APPLICATION TIER (Multi-AZ) │ │ │
│ │ │ ┌──────────────┐ ┌────┴─────────┐ ┌──────────────┐ │ │ │
│ │ │ │ 10.0.11.0/24 │ │ 10.0.12.0/24 │ │ 10.0.13.0/24 │ │ │ │
│ │ │ │ (AZ-a) │ │ (AZ-b) │ │ (AZ-c) │ │ │ │
│ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ ECS FARGATE CLUSTER │ │ │ │
│ │ │ │ ┌─────────────────┐ ┌──────────────────┐ │ │ │ │
│ │ │ │ │ API Gateway Svc │ │ Business Mgmt │ │ │ │ │
│ │ │ │ │ (2-20 tasks) │ │ Service │ │ │ │ │
│ │ │ │ │ 2vCPU/4GB │ │ (2-15 tasks) │ │ │ │ │
│ │ │ │ └─────────────────┘ └──────────────────┘ │ │ │ │
│ │ │ │ ┌─────────────────┐ ┌──────────────────┐ │ │ │ │
│ │ │ │ │ User Service │ │ Review & Rating │ │ │ │ │
│ │ │ │ │ (2-10 tasks) │ │ Service │ │ │ │ │
│ │ │ │ │ 1vCPU/2GB │ │ (2-10 tasks) │ │ │ │ │
│ │ │ │ └─────────────────┘ └──────────────────┘ │ │ │ │
│ │ │ └────────────────────────────────────────────────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ AWS LAMBDA FUNCTIONS │ │ │ │
│ │ │ │ [Image Processor] [Search Indexer] [Email Service] │ │ │ │
│ │ │ │ [Analytics Aggregator] │ │ │ │
│ │ │ └────────────────────────────────────────────────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ ElastiCache for Redis (Cluster Mode) │ │ │ │
│ │ │ │ - 2x cache.r6g.large nodes │ │ │ │
│ │ │ │ - Session cache, API cache, Listing cache │ │ │ │
│ │ │ └────────────────────────────────────────────────────────┘ │ │ │
│ │ └───────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ PRIVATE SUBNETS - DATA TIER (Multi-AZ) │ │ │
│ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │
│ │ │ │ 10.0.21.0/24 │ │ 10.0.22.0/24 │ │ 10.0.23.0/24 │ │ │ │
│ │ │ │ (AZ-a) │ │ (AZ-b) │ │ (AZ-c) │ │ │ │
│ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ Amazon Aurora PostgreSQL (Multi-AZ) │ │ │ │
│ │ │ │ - Primary: db.r6g.xlarge (AZ-a) │ │ │ │
│ │ │ │ - Replica: db.r6g.large (AZ-b) │ │ │ │
│ │ │ │ - Replica: db.r6g.large (AZ-c) │ │ │ │
│ │ │ │ [Business, Users, Subscriptions, Transactions] │ │ │ │
│ │ │ └────────────────────────────────────────────────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ Amazon OpenSearch Service (Multi-AZ) │ │ │ │
│ │ │ │ - 3x c6g.large.search (Master nodes) │ │ │ │
│ │ │ │ - 6x r6g.xlarge.search (Data nodes) │ │ │ │
│ │ │ │ [Full-text search, Geospatial queries, Analytics] │ │ │ │
│ │ │ └────────────────────────────────────────────────────────┘ │ │ │
│ │ └───────────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ REGIONAL MANAGED SERVICES │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │ │
│ │ │ DynamoDB │ │ S3 Buckets │ │ SQS Queues │ │ │
│ │ │ - Sessions │ │ - Images │ │ - Events │ │ │
│ │ │ - Analytics │ │ - Backups │ │ - Async Jobs │ │ │
│ │ └──────────────┘ └──────────────┘ └─────────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │ │
│ │ │ SNS Topics │ │ SES │ │ EventBridge │ │ │
│ │ │ - Alerts │ │ - Emails │ │ - Event Router │ │ │
│ │ └──────────────┘ └──────────────┘ └─────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ MONITORING & SECURITY │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │ │
│ │ │ CloudWatch │ │ X-Ray │ │ CloudTrail │ │ │
│ │ │ - Logs │ │ - Tracing │ │ - Audit Logs │ │ │
│ │ │ - Metrics │ │ │ │ │ │ │
│ │ └──────────────┘ └──────────────┘ └─────────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │ │
│ │ │ Secrets Mgr │ │ KMS │ │ Security Hub │ │ │
│ │ │ - Creds │ │ - Encrypt │ │ - Compliance │ │ │
│ │ └──────────────┘ └──────────────┘ └─────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ CI/CD PIPELINE (GitHub / AWS) │
│ [GitHub] → [GitHub Actions] → [CodeBuild] → [ECR] → [CodeDeploy] → [ECS] │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ DISASTER RECOVERY REGION (us-west-2 / Secondary) │
│ [Standby Aurora Replica] [S3 Cross-Region Replication] [AMI Backups] │
└─────────────────────────────────────────────────────────────────────────────┘
Data Flow:
- User requests → CloudFront → Route 53 → WAF → ALB
- ALB → ECS Services (API Gateway → Business/User/Review services)
- Services → Aurora (write), Read Replicas (read), OpenSearch (search)
- Services → ElastiCache (cache check) → DynamoDB (sessions/analytics)
- Async operations → SQS → Lambda → S3/OpenSearch/SNS
- All logs → CloudWatch, traces → X-Ray, audit → CloudTrail
4. High Availability & Disaster Recovery
Multi-AZ Deployment Strategy
- Compute: ECS tasks distributed across 3 AZs (us-east-1a, 1b, 1c)
- Database: Aurora primary in AZ-a, replicas in AZ-b and AZ-c with automatic failover (30-120 seconds)
- Search: OpenSearch deployed across 3 AZs with 1 replica shard per index
- Cache: ElastiCache cluster mode with nodes in multiple AZs
- Load Balancer: ALB with cross-zone load balancing enabled
- NAT Gateways: 3 NAT Gateways (one per AZ) to eliminate single points of failure
Auto-Scaling Policies
ECS Service Auto-Scaling:
- Metric: Target CPU 70%, Memory 80%
- Scale-out: Add 50% capacity when threshold exceeded for 2 minutes
- Scale-in: Remove 25% capacity when below 40% for 10 minutes
- Cooldown: 60s scale-out, 300s scale-in
Aurora Read Replica Auto-Scaling:
- Trigger: CPU > 75% for 5 minutes
- Min replicas: 2, Max replicas: 5
- Scale-in: CPU < 40% for 15 minutes
OpenSearch Auto-Scaling:
- Storage: Auto-scale when 80% full (up to 3TB per node)
- Manual scaling for data nodes based on query performance
Backup & Restore
Aurora PostgreSQL:
- Automated backups: Daily, 7-day retention
- Manual snapshots: Weekly, 30-day retention
- Point-in-time recovery: Up to 5 minutes in the past
- Cross-region backup: Daily snapshot copy to us-west-2
OpenSearch:
- Automated snapshots: Hourly to S3 (24-hour retention)
- Manual snapshots: Daily, 14-day retention
- Restore time: ~15-30 minutes for 100GB index
DynamoDB:
- Point-in-time recovery (PITR): Enabled, 35-day retention
- On-demand backups: Weekly to S3
S3:
- Versioning: Enabled on all buckets
- Cross-region replication: Critical buckets to us-west-2
- Lifecycle policies: Transition to Glacier after 365 days
RTO/RPO Targets
| Component | RPO (Data Loss) | RTO (Downtime) | Mechanism |
|---|---|---|---|
| Aurora DB | < 5 minutes | < 2 minutes | Multi-AZ automated failover |
| OpenSearch | < 1 hour | < 30 minutes | Snapshot restore |
| DynamoDB | < 1 second | < 1 minute | Multi-AZ replication |
| ECS Services | 0 (stateless) | < 1 minute | Auto-scaling, health checks |
| S3 | 0 (versioning) | Immediate | Multi-AZ storage |
Failover Mechanisms
- DNS Failover: Route 53 health checks with automatic failover to DR region (TTL: 60s)
- Database Failover: Aurora automatic failover to read replica (30-120s)
- Application Failover: ALB health checks remove unhealthy targets in 30s
- Cache Failover: ElastiCache automatic node replacement in cluster mode
5. Security Implementation
Network Security
Security Groups:
- ALB-SG: Inbound 443 (0.0.0.0/0), Outbound 8080 (ECS-SG)
- ECS-SG: Inbound 8080 (ALB-SG), Outbound 443 (all), 5432 (RDS-SG), 9200 (OpenSearch-SG), 6379 (Cache-SG)
- RDS-SG: Inbound 5432 (ECS-SG), Outbound none
- OpenSearch-SG: Inbound 9200, 9300 (ECS-SG), Outbound none
- Cache-SG: Inbound 6379 (ECS-SG), Outbound none
NACLs:
- Public subnets: Allow 80, 443 inbound, ephemeral outbound
- Private subnets: Deny all inbound from internet, allow VPC CIDR
- Data subnets: Deny all except from application subnet CIDR
AWS WAF Rules:
- Rate limiting: 2000 requests per 5 minutes per IP
- SQL Injection: AWS Managed Rules (SQLi_QUERYARGUMENTS)
- XSS: AWS Managed Rules (XSS_BODY, XSS_COOKIE)
- Geographic blocking: Block traffic from high-risk countries (optional)
- IP reputation lists: AWS IP reputation managed rule group
IAM Roles & Policies (Least Privilege)
ECS Task Execution Role:
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchGetImage",
"logs:CreateLogStream",
"logs:PutLogEvents",
"secretsmanager:GetSecretValue"
]
}
ECS Task Role (per service):
- Business Service: RDS access, S3 read/write, OpenSearch write
- User Service: RDS access, DynamoDB access, SES send
- API Gateway: No direct resource access (delegates to services)
Lambda Execution Roles:
- Image Processor: S3 read/write, CloudWatch Logs
- Search Indexer: OpenSearch write, SQS read, CloudWatch Logs
Data Encryption
At-Rest:
-
Aurora: KMS encryption (customer-managed key:
alias/directory-db) -
OpenSearch: KMS encryption (customer-managed key:
alias/directory-search) -
DynamoDB: KMS encryption (customer-managed key:
alias/directory-nosql) - S3: SSE-S3 for non-sensitive, SSE-KMS for sensitive data
- EBS (OpenSearch): KMS encryption enabled
In-Transit:
- ALB → Clients: TLS 1.2+ (ACM certificate)
- ECS → RDS: TLS enforced (require_secure_transport=ON)
- ECS → OpenSearch: HTTPS only
- ECS → ElastiCache: Redis AUTH + TLS enabled
- Inter-service: Internal ALB with TLS
Secrets Management
- AWS Secrets Manager: Database passwords, API keys, OAuth tokens
- Rotation: Automated 30-day rotation for RDS credentials
- Access: IAM policy enforcement, CloudTrail logging of all access
- Encryption: All secrets encrypted with KMS
Compliance Considerations
- PCI-DSS: If handling payments (Stripe integration reduces scope)
- GDPR: Data residency controls, encryption, right to deletion (S3 lifecycle)
- SOC 2: CloudTrail audit logs, encryption at rest/in transit
- HIPAA: Not applicable unless health-related businesses require it
DDoS Protection
- AWS Shield Standard: Automatic protection (included)
- AWS Shield Advanced: Optional (\$3000/month) for advanced protection
- CloudFront: Absorbs layer 3/4 attacks
- WAF Rate Limiting: Application-layer protection
- Auto-scaling: Absorbs traffic spikes
6. Well-Architected Framework Alignment
Operational Excellence
- Infrastructure as Code: 100% Terraform-managed infrastructure, version controlled in Git
- Monitoring: CloudWatch dashboards for all services, custom metrics for business KPIs (searches/min, listings created/hour)
- Alerting: SNS notifications for critical alarms (CPU > 85%, error rate > 1%, latency > 2s)
- Automation: CI/CD pipeline with automated testing, blue-green deployments, automated backups
- Runbooks: Documented incident response procedures in Confluence/Notion
- Game Days: Quarterly chaos engineering exercises (failover testing)
Security
- Identity Management: IAM roles with least privilege, MFA enforced for console access, OIDC for CI/CD
- Detective Controls: GuardDuty threat detection, CloudTrail audit logging (90-day retention), Security Hub compliance dashboards
- Data Protection: KMS encryption (at-rest), TLS 1.2+ (in-transit), Secrets Manager rotation, S3 versioning
- Incident Response: Automated alerting via SNS, CloudWatch Logs Insights for forensics, AWS Config for compliance tracking
- Network Protection: VPC isolation, security groups, NACLs, WAF rules, private subnets for data tier
Reliability
- Fault Tolerance: Multi-AZ deployment (3 AZs), ECS tasks across AZs, Aurora Multi-AZ, OpenSearch replicas
- Backup Strategy: Automated daily backups (Aurora, OpenSearch, DynamoDB PITR), cross-region replication for critical data
- Auto-Healing: ECS health checks replace failed tasks, Aurora automatic failover, ALB removes unhealthy targets
- Change Management: Blue-green deployments, canary releases, automated rollback on failure
- Monitoring: Real-time CloudWatch metrics, X-Ray distributed tracing, synthetic monitoring (CloudWatch Synthetics)
Performance Efficiency
- Right-Sizing: Graviton2 instances (r6g, c6g) for 20% better price-performance, Auto-scaling based on metrics
- Caching: CloudFront CDN (global edge), ElastiCache Redis (API responses, sessions, listings), Aurora query cache
- Database Optimization: Read replicas for read-heavy workloads, Aurora I/O-Optimized for predictable costs
- Search Optimization: OpenSearch with proper shard sizing (10-50GB per shard), hot/warm architecture for time-series data
- CDN Usage: CloudFront for static assets, images, and optionally API responses (reduces origin load by 60-80%)
Cost Optimization
- Resource Optimization: Fargate Spot for non-critical tasks (70% savings), S3 Intelligent-Tiering, EBS gp3 over gp2
- Reserved Capacity: 1-year RDS Reserved Instances (40% savings), ElastiCache Reserved Nodes (30% savings)
- Savings Plans: Compute Savings Plans for ECS Fargate (up to 50% savings)
- Rightsizing: CloudWatch metrics to identify underutilized resources, Lambda for event-driven tasks
- Monitoring: AWS Cost Explorer, Budget alerts at 80% threshold, Trusted Advisor cost checks
Sustainability
- Resource Efficiency: Graviton2 processors (60% better energy efficiency), auto-scaling prevents idle resources
- Minimal Idle: Shut down dev/staging environments off-hours (Lambda scheduler), DynamoDB on-demand for variable workloads
- Managed Services: Leverage AWS-managed services (reduced carbon footprint vs self-managed)
- Data Lifecycle: S3 lifecycle policies archive old data, delete unnecessary logs after 30 days
7. Deployment Flow
Step-by-Step Deployment Process
Phase 1: Infrastructure Provisioning (Terraform)
- VPC & Networking: Deploy VPC, subnets, NAT gateways, route tables, security groups
- Data Layer: Provision Aurora cluster, DynamoDB tables, OpenSearch domain, ElastiCache cluster
- Compute Layer: Create ECS cluster, task definitions, ALB, target groups
- Storage: Create S3 buckets with versioning, lifecycle policies
- Security: Configure KMS keys, Secrets Manager secrets, IAM roles/policies
- Monitoring: Set up CloudWatch log groups, dashboards, alarms, SNS topics
Phase 2: Application Deployment
- Container Build: GitHub Actions triggers on merge to main
- CodeBuild: Builds Docker images, runs unit tests (Jest/Mocha)
- Security Scan: Trivy/Snyk scans images for vulnerabilities
- ECR Push: Successful builds push to Amazon ECR
- Database Migration: Run Flyway/Liquibase migrations (automated in CodePipeline)
- ECS Deployment: CodeDeploy updates ECS services with new task definitions
CI/CD Pipeline Architecture
GitHub → GitHub Actions → CodeBuild → ECR → CodeDeploy → ECS
│ │ │ │ │ │
│ │ │ │ │ └─→ Health Checks
│ │ │ │ └─→ Blue/Green Deploy
│ │ │ └─→ Image Versioning
│ │ └─→ Unit/Integration Tests
│ └─→ Terraform Plan (on PR)
└─→ Trigger on Push/PR
GitHub Actions Workflow:
name: Deploy to Production
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- Checkout code
- Configure AWS credentials (OIDC)
- Build Docker images
- Run tests
- Push to ECR
- Update ECS task definition
- Trigger CodeDeploy (Blue/Green)
- Run smoke tests
- Notify Slack/Email on status
Blue-Green Deployment Strategy
ECS Blue/Green with CodeDeploy:
- Blue (Current): Production traffic on task set v1.2.3
- Green (New): Deploy task set v1.2.4 to same cluster
- Test Traffic: Route 10% traffic to Green for 5 minutes
- Health Check: Monitor error rates, latency, success rate
- Full Cutover: If healthy, route 100% traffic to Green
- Terminate Blue: Keep Blue for 1 hour, then terminate
- Rollback: If issues, instant rollback to Blue (< 60s)
Canary Deployment (Alternative for Lambda):
- Deploy new Lambda version
- Route 10% traffic → wait 5 min → 25% → wait 10 min → 50% → 100%
- Automated rollback on CloudWatch alarms (error rate, duration)
Rollback Procedures
Automated Rollback:
- CodeDeploy: Automatic rollback on CloudWatch alarm (error rate > 1%)
- Trigger alarms: HTTP 5xx > 10 requests/min, Latency > 3s P99
Manual Rollback:
- Identify previous stable task definition/image tag
- Update ECS service with previous task definition
- Force new deployment (drains old tasks, starts new)
- Verify health via CloudWatch metrics and logs
- Time: < 5 minutes for complete rollback
8. Monitoring & Operations
Key Metrics to Monitor
Application Metrics:
- Request Rate: Requests per second (RPS), searches per minute
- Latency: P50, P90, P99, P99.9 response times
- Error Rate: HTTP 4xx, 5xx errors per minute
- Availability: Uptime percentage (target: 99.95%)
- Business Metrics: New listings/hour, user registrations/day, search conversion rate
Infrastructure Metrics:
- ECS: CPU utilization, memory utilization, task count, health check failures
- Aurora: CPU, connections, read/write latency, replica lag, deadlocks
- OpenSearch: Cluster status, JVM memory, indexing rate, search latency, shard status
- ElastiCache: CPU, evictions, cache hit rate, connections, network I/O
- ALB: Target response time, healthy/unhealthy host count, request count, 5xx errors
Alerting Thresholds
| Metric | Warning | Critical | Action |
|---|---|---|---|
| ECS CPU | > 70% | > 85% | Scale out tasks |
| ECS Memory | > 75% | > 90% | Scale out tasks |
| Aurora CPU | > 70% | > 85% | Add read replica |
| Aurora Connections | > 500 | > 700 | Investigate leaks |
| OpenSearch JVM | > 75% | > 85% | Scale data nodes |
| ElastiCache Hit Rate | < 80% | < 60% | Review cache strategy |
| API Latency P99 | > 2s | > 3s | Investigate bottleneck |
| Error Rate | > 0.5% | > 1% | Page on-call engineer |
Log Aggregation Strategy
CloudWatch Logs:
-
Application Logs:
/aws/ecs/directory-api,/aws/ecs/directory-business -
Access Logs:
/aws/alb/directory-alb -
Lambda Logs:
/aws/lambda/directory-* - Database Logs: Aurora slow query logs (queries > 1s)
- Retention: 30 days (compliance), export to S3 for long-term storage
Log Analysis:
- CloudWatch Logs Insights: Query logs for patterns, errors, slow requests
- X-Ray Service Map: Visualize service dependencies, trace requests end-to-end
-
Example Query:
fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc
Dashboard Requirements
Operational Dashboard (Real-time):
- Service health status (green/yellow/red)
- Request rate, error rate, latency (last 1 hour)
- Active ECS tasks, database connections
- OpenSearch cluster health, cache hit rate
- Current auto-scaling activity
Business Dashboard (Daily/Weekly):
- Total business listings (active/inactive)
- New user registrations, daily active users
- Search queries (total, by category, by location)
- Revenue metrics (premium listings, ad clicks)
- Conversion funnel (search → view → contact)
Cost Dashboard:
- Daily spend by service (EC2, RDS, OpenSearch, data transfer)
- Month-to-date vs budget
- Forecast for month-end spending
- Top 10 cost drivers
Incident Response Workflow
- Detection: CloudWatch alarm triggers SNS notification to PagerDuty/Slack
- Acknowledgment: On-call engineer acknowledges within 5 minutes
- Investigation: Check CloudWatch dashboards, logs, X-Ray traces
- Mitigation: Execute runbook (rollback, scale up, restart service)
- Communication: Update status page, notify stakeholders
- Resolution: Verify metrics return to normal, close incident
- Post-Mortem: Document root cause, corrective actions (within 48 hours)
9. Cost Estimation
Production Environment Monthly Costs (Assumptions: 1M listings, 10M searches/month, 100K DAU)
| Service | Configuration | Quantity | Unit Cost | Monthly Cost |
|---|---|---|---|---|
| Compute | \$1,458 | |||
| ECS Fargate | 2vCPU, 4GB (API Gateway) | 10 tasks avg | \$0.08468/hr | \$622 |
| ECS Fargate | 2vCPU, 4GB (Business) | 8 tasks avg | \$0.08468/hr | \$498 |
| ECS Fargate | 1vCPU, 2GB (User/Review) | 8 tasks avg | \$0.04234/hr | \$248 |
| Lambda | 512MB, 5M invocations | 10s avg | \$0.20/1M | \$90 |
| Database | \$1,247 | |||
| Aurora Primary | db.r6g.xlarge | 1 instance | \$0.52/hr | \$380 |
| Aurora Replicas | db.r6g.large | 2 instances | \$0.26/hr ea. | \$380 |
| Aurora Storage | 500GB | 500GB | \$0.10/GB | \$50 |
| Aurora I/O | I/O-Optimized | Included | \$0 | \$0 |
| Aurora Backup | 500GB | 500GB | \$0.021/GB | \$11 |
| DynamoDB | On-demand | 10GB, 10M R, 2M W | Variable | \$26 |
| ElastiCache | cache.r6g.large | 2 nodes | \$0.218/hr | \$320 |
| Search | \$1,833 | |||
| OpenSearch Master | c6g.large.search | 3 nodes | \$0.113/hr | \$248 |
| OpenSearch Data | r6g.xlarge.search | 6 nodes | \$0.371/hr | \$1,628 |
| OpenSearch Storage | gp3 200GB per node | 1200GB | Included | \$0 |
| Storage | \$178 | |||
| S3 Standard | Images, assets | 2TB | \$0.023/GB | \$47 |
| S3 Requests | PUT/GET | 100M | \$0.005/10K | \$50 |
| S3 Data Transfer | Out to internet | 1TB | \$0.09/GB | \$90 |
| EBS Snapshots | Backups | 400GB | \$0.05/GB | \$20 |
| Networking | \$387 | |||
| ALB | 2 ALBs | 730 hrs | \$0.0252/hr | \$37 |
| ALB LCU | ~2 LCU avg | 1460 hrs | \$0.008/hr | \$12 |
| NAT Gateway | 3 NAT Gateways | 2190 hrs | \$0.045/hr | \$99 |
| NAT Data Transfer | 1TB processed | 1TB | \$0.045/GB | \$46 |
| CloudFront | 2TB out, 100M req | Variable | \$0.085/GB | \$193 |
| Security & Mgmt | \$117 | |||
| Secrets Manager | 10 secrets | 10 | \$0.40/secret | \$4 |
| KMS | 3 keys, 1M requests | 3 + requests | \$1 + \$0.03/10K | \$7 |
| WAF | 1 ACL, 5 rules | Variable | \$5 + \$1/rule | \$10 |
| CloudWatch Logs | 50GB ingested | 50GB | \$0.50/GB | \$25 |
| CloudWatch Metrics | 500 custom | 500 | \$0.30/metric | \$150 |
| GuardDuty | Account analysis | 1 account | ~\$3/day | \$90 |
| Others | \$43 | |||
| Route 53 | 1 hosted zone | 1 | \$0.50/zone | \$1 |
| Route 53 Queries | 100M queries | 100M | \$0.40/1M | \$40 |
| SES | 100K emails | 100K | \$0.10/1K | \$10 |
| SNS | 10K notifications | 10K | \$0.50/1M | \$1 |
| SQS | 50M requests | 50M | \$0.40/1M | \$20 |
| CodePipeline | 1 pipeline | 1 | \$1/pipeline | \$1 |
| ECR Storage | 50GB | 50GB | \$0.10/GB | \$5 |
| TOTAL PRODUCTION | \$5,263/month |
Development Environment Monthly Costs
| Service | Configuration | Monthly Cost |
|---|---|---|
| ECS Fargate | 50% of prod tasks | \$350 |
| Aurora | db.r6g.large (1 instance) | \$190 |
| OpenSearch | 3 nodes (smaller) | \$600 |
| ElastiCache | 1 node | \$160 |
| Other services | 30% of prod | \$400 |
| TOTAL DEV | \$1,700/month |
Total Estimated Monthly Cost
- Production: \$5,263
- Development: \$1,700
- Total: \$6,963/month (~\$83,556/year)
Cost Optimization Recommendations
-
Reserved Instances (1-year, No Upfront):
- Aurora: Save \$2,736/year (40% on \$570/month)
- ElastiCache: Save \$1,152/year (30% on \$320/month)
- Total Savings: ~\$3,888/year
-
Compute Savings Plans:
- ECS Fargate: Save ~\$600/year (30% on \$1,458/month compute)
-
Right-Sizing:
- Monitor CloudWatch metrics for 30 days, downsize underutilized instances
- Potential savings: 10-15% (\$500-750/month)
-
Dev Environment Automation:
- Auto-shutdown off-hours (nights, weekends): Save ~\$850/month (50% of dev costs)
- Lambda scheduler to stop/start resources
-
S3 Optimization:
- Implement S3 Lifecycle policies (Standard → IA → Glacier)
- Potential savings: 30% on old assets (\$15-20/month)
-
OpenSearch Alternative:
- For lower search volumes, consider Algolia (managed, pay-per-search)
- Break-even: ~50K searches/month vs self-managed OpenSearch
Optimized Production Cost: ~\$4,000-4,500/month with reservations and automation
10. Implementation Roadmap
Phase 1: Foundation (Weeks 1-4)
Week 1-2: Infrastructure Setup
- Set up AWS Organizations, multi-account structure (dev/staging/prod)
- Configure Terraform state backend (S3 + DynamoDB)
- Deploy VPC, subnets, security groups, NAT gateways
- Set up IAM roles, KMS keys, Secrets Manager
- Deliverable: Complete network infrastructure
Week 3-4: Data Layer
- Provision Aurora PostgreSQL cluster with read replicas
- Deploy OpenSearch domain with proper sizing
- Create DynamoDB tables (sessions, analytics)
- Set up ElastiCache Redis cluster
- Configure S3 buckets with lifecycle policies
- Deliverable: Functional data layer with backups
Phase 2: Application Development (Weeks 5-10)
Week 5-6: Core Services
- Develop User Service (authentication, registration, profile)
- Develop Business Service (CRUD, validation, approval workflow)
- Implement database schema and migrations
- Unit tests (80% coverage target)
- Deliverable: Core microservices with tests
Week 7-8: Search & Discovery
- Integrate OpenSearch with Business Service
- Implement geospatial search (radius, location-based)
- Build category taxonomy and filtering
- Develop Search Service API
- Deliverable: Working search functionality
Week 9-10: Supporting Services
- Review & Rating Service
- Image upload/processing Lambda
- Email notification service (SES integration)
- Admin panel backend
- Deliverable: Complete backend services
Phase 3: Frontend & Integration (Weeks 11-14)
Week 11-12: Web Application
- Next.js frontend with SSR for SEO
- Search interface with filters
- Business listing pages
- User dashboard
- Deliverable: Functional web application
Week 13-14: Integration & Testing
- Integration testing (Cypress/Playwright)
- Performance testing (JMeter/k6)
- Security testing (OWASP ZAP)
- UAT with stakeholders
- Deliverable: Tested, integrated system
Phase 4: DevOps & Production (Weeks 15-18)
Week 15-16: CI/CD Pipeline
- Set up GitHub Actions workflows
- Configure CodePipeline, CodeBuild, CodeDeploy
- Implement blue-green deployment
- Container security scanning
- Deliverable: Automated deployment pipeline
Week 17: Monitoring & Observability
- CloudWatch dashboards and alarms
- X-Ray distributed tracing
- Log aggregation and analysis setup
- PagerDuty/Slack integration
- Deliverable: Complete monitoring system
Week 18: Production Deployment
- Production infrastructure deployment
- Database migration and seed data
- DNS cutover (Route 53)
- Go-live checklist execution
- Deliverable: Live production system
Phase 5: Optimization & Scaling (Weeks 19-22)
Week 19-20: Performance Optimization
- Implement caching strategies
- Database query optimization
- OpenSearch index tuning
- CDN configuration
- Deliverable: Optimized performance
Week 21-22: Documentation & Handover
- Architecture documentation
- Runbooks and playbooks
- Team training
- Knowledge transfer
- Deliverable: Complete documentation
Timeline Estimate: 22 weeks (5.5 months)
Critical Path Items
- VPC and networking setup (blocking all else)
- Database provisioning (blocking application development)
- Core services development (blocking frontend)
- OpenSearch integration (blocking search features)
- CI/CD pipeline (blocking production deployment)
Team Skill Requirements
| Role | Count | Skills Required |
|---|---|---|
| Solutions Architect | 1 | AWS, System Design, Terraform |
| Backend Engineers | 3 | Node.js/Python, Microservices, Databases |
| Frontend Engineer | 2 | React, Next.js, TypeScript |
| DevOps Engineer | 1 | Terraform, CI/CD, AWS, Docker |
| QA Engineer | 1 | Testing frameworks, Automation |
| Product Manager | 1 | Requirements, Stakeholder management |
Total Team: 9 people
11. Assumptions & Prerequisites
Traffic/User Load Assumptions
- Daily Active Users (DAU): 100,000
- Monthly Active Users (MAU): 500,000
- Peak Concurrent Users: 10,000
- Average Requests per User: 20/session
- Search Queries: 10 million/month
- New Listings: 10,000/month
- Total Business Listings: 1 million (initial), growing 1% monthly
- Peak Traffic: 3x average (during business hours, marketing campaigns)
- Geographic Distribution: 70% US, 20% EU, 10% APAC
Data Volume Assumptions
- Database Size: 500GB initially, growing 50GB/month
- Images/Assets: 2TB initially, growing 100GB/month
- Log Data: 50GB/month
- Backup Storage: 1TB total
- OpenSearch Index: 100GB initially, growing 10GB/month
- Average Business Listing: 5KB (text + metadata)
- Average Image: 500KB (after compression)
Availability Requirements
- Target Uptime: 99.95% (4.38 hours downtime/year)
- Maintenance Windows: Monthly, 2 AM - 4 AM EST, < 30 min
- RTO (Recovery Time Objective): 2 hours
- RPO (Recovery Point Objective): 5 minutes
Required Team Expertise
- AWS Services: VPC, ECS, RDS Aurora, OpenSearch, CloudFormation/Terraform
- Programming: Node.js/Python, SQL, JavaScript/TypeScript
- DevOps: Docker, CI/CD, Infrastructure as Code
- Databases: PostgreSQL, DynamoDB, Redis, OpenSearch/Elasticsearch
- Frontend: React, Next.js, responsive design
Existing Infrastructure Considerations
- Greenfield Deployment: No existing infrastructure (fresh AWS account)
- Domain Name: Owned, ready to transfer to Route 53
- SSL Certificates: Will be provisioned via ACM
- Third-Party Integrations: Stripe account, Google Maps API key
- Data Migration: Not applicable (new platform)
12. Risks & Mitigations
Technical Risks
| Risk | Impact | Probability | Mitigation |
|---|---|---|---|
| OpenSearch cost overrun | High | Medium | Monitor query patterns, implement caching, consider Aurora for simple searches |
| Database performance bottleneck | High | Medium | Aurora read replicas, query optimization, caching layer, connection pooling |
| NAT Gateway costs exceed budget | Medium | High | VPC endpoints for AWS services (S3, DynamoDB), review data transfer patterns |
| Lambda cold starts impact UX | Medium | Medium | Provisioned concurrency for critical functions, use ECS for latency-sensitive |
| OpenSearch cluster downtime | High | Low | Multi-AZ deployment, automated snapshots, documented restore procedures |
| Data transfer costs | Medium | High | CloudFront caching, compress assets, S3 Transfer Acceleration |
| Security breach | Critical | Low | WAF, GuardDuty, Security Hub, regular audits, pen testing, compliance checks |
| Vendor lock-in to AWS | Medium | High | Use Terraform (portable IaC), abstract AWS SDK calls, document alternatives |
Mitigation Strategies
Cost Management:
- Budget Alerts: Set CloudWatch billing alarms at 80%, 90%, 100% of budget
- Regular Reviews: Monthly cost analysis, identify anomalies
- Reserved Capacity: Purchase RIs after 3 months of stable usage patterns
- Right-Sizing: Quarterly review of instance utilization, downsize underutilized
Performance Assurance:
- Load Testing: Pre-launch testing with 2x expected peak load
- Performance Monitoring: Real-time CloudWatch dashboards, alert on P99 > 2s
- Capacity Planning: Quarterly forecast based on growth trends
- Caching Strategy: Multi-layer (CloudFront, ElastiCache, in-memory)
Disaster Recovery:
- Quarterly DR Drills: Test failover to DR region, measure RTO/RPO
- Backup Verification: Monthly restore testing from snapshots
- Chaos Engineering: Simulate failures (random task termination, AZ outage)
Security Hardening:
- Penetration Testing: Annual third-party pen test
- Compliance Audits: Quarterly internal audits (SOC 2, GDPR)
- Security Training: Developer security training, secure coding practices
- Patch Management: Automated OS patching (Systems Manager Patch Manager)
Alternative Approaches Considered
1. Serverless-First Architecture (Lambda + API Gateway)
- Pros: Lower cost at low scale, no infrastructure management
- Cons: Cold starts, timeout limits, complex orchestration, vendor lock-in
- Rejected: Complex business logic better suited for long-running services
2. Kubernetes (EKS) Instead of ECS
- Pros: Industry standard, multi-cloud portability, rich ecosystem
- Cons: Higher operational complexity, steeper learning curve, higher costs
- Rejected: ECS Fargate simpler for this use case, team expertise
3. Self-Managed Elasticsearch Instead of OpenSearch Service
- Pros: More control, potentially lower cost
- Cons: Operational overhead, patching, scaling complexity
- Rejected: Managed service reduces toil, built-in HA
4. Aurora Serverless v2 Instead of Provisioned
- Pros: Auto-scaling, pay-per-use
- Cons: Less predictable costs, cold start delays, ACU pricing complexity
- Decision: Use provisioned for predictable workloads, consider serverless for dev/staging
5. NoSQL-Only (DynamoDB) Instead of Relational
- Pros: Unlimited scale, low latency
- Cons: Complex queries difficult, no transactions (at scale), data modeling complexity
- Rejected: Relational model better for business directory use case (joins, ACID)
Success Criteria
✅ Performance: P99 search latency < 500ms, listing page load < 1s
✅ Availability: 99.95% uptime, max 4.38 hours downtime/year
✅ Scalability: Handle 10x traffic growth without architecture changes
✅ Cost: Stay within \$6,000/month production budget (optimize to \$4,500)
✅ Security: Pass security audit, zero critical vulnerabilities
✅ Recovery: Achieve RTO < 2 hours, RPO < 5 minutes in DR tests
This comprehensive solution provides a production-ready, highly available online business directory platform following AWS Well-Architected Framework principles. The architecture balances performance, cost, and operational simplicity using managed AWS services, enabling rapid deployment and scalable growth.

Top comments (0)