DEV Community

Manish Kumar
Manish Kumar

Posted on

Designing Enterprise-Grade AWS Architecture for a Scalable Online Business Directory Platform

Designing Enterprise-Grade AWS Architecture

1. Solution Overview

The proposed solution is a cloud-native, multi-tenant business directory platform built on AWS using a hybrid microservices and serverless architecture. This platform enables businesses to list their services, users to search and discover local businesses, and provides monetization through premium listings, advertisements, and subscription tiers.

Key Business Objectives:

  • Deliver highly available search and discovery experience with 99.95% uptime
  • Support millions of business listings with real-time updates
  • Enable geospatial search with sub-second response times
  • Scale elastically based on traffic patterns (peak/off-peak)
  • Minimize operational overhead through managed services
  • Support multi-region deployment for global reach

Architectural Approach: Event-driven microservices with serverless components for cost optimization, leveraging managed services for search (OpenSearch), caching (ElastiCache), and databases (Aurora PostgreSQL + DynamoDB).


2. Architecture Components

AWS Services & Resources

Compute Layer

  • Amazon ECS on Fargate (serverless containers)
    • API Gateway Service: 2 vCPU, 4GB RAM, auto-scale 2-20 tasks
    • Business Management Service: 2 vCPU, 4GB RAM, auto-scale 2-15 tasks
    • User Service: 1 vCPU, 2GB RAM, auto-scale 2-10 tasks
    • Review & Rating Service: 1 vCPU, 2GB RAM, auto-scale 2-10 tasks
  • AWS Lambda (event-driven functions)
    • Image processing: 1024MB, 60s timeout
    • Search indexing: 512MB, 30s timeout
    • Email notifications: 256MB, 15s timeout
    • Analytics aggregation: 1024MB, 120s timeout

Storage Layer

  • Amazon S3
    • Business images/logos: S3 Standard (with lifecycle to Glacier after 1 year)
    • Static website assets: S3 Standard with CloudFront CDN
    • Backups: S3 Intelligent-Tiering
    • Bucket policies: Versioning enabled, encryption at rest (SSE-S3)
  • Amazon EBS
    • gp3 volumes for OpenSearch nodes (200GB per node)

Database Layer

  • Amazon Aurora PostgreSQL (version 15.x)
    • Primary DB: db.r6g.xlarge (4 vCPU, 32GB RAM) - Multi-AZ
    • Read replicas: 2x db.r6g.large (2 vCPU, 16GB RAM)
    • Database: Business listings, user accounts, subscriptions, transactions
    • Aurora I/O-Optimized configuration for predictable costs
  • Amazon DynamoDB
    • User sessions (on-demand capacity)
    • Real-time analytics counters (provisioned: 50 RCU, 25 WCU)
    • Business activity logs (on-demand capacity)
  • Amazon OpenSearch Service
    • Domain: business-directory-search
    • Master nodes: 3x c6g.large.search (2 vCPU, 4GB RAM)
    • Data nodes: 6x r6g.xlarge.search (4 vCPU, 32GB RAM, 200GB gp3 each)
    • Multi-AZ with 1 replica per index
  • Amazon ElastiCache for Redis
    • cache.r6g.large (2 nodes, cluster mode enabled)
    • Cache: API responses, session data, frequently accessed listings

Networking Layer

  • Amazon VPC
    • CIDR: 10.0.0.0/16
    • Public Subnets: 10.0.1.0/24 (AZ-a), 10.0.2.0/24 (AZ-b), 10.0.3.0/24 (AZ-c)
    • Private Subnets (App): 10.0.11.0/24 (AZ-a), 10.0.12.0/24 (AZ-b), 10.0.13.0/24 (AZ-c)
    • Private Subnets (Data): 10.0.21.0/24 (AZ-a), 10.0.22.0/24 (AZ-b), 10.0.23.0/24 (AZ-c)
    • NAT Gateways: 3 (one per AZ for high availability)
  • Application Load Balancer (ALB)
    • Internet-facing ALB for web traffic
    • Internal ALB for microservices communication
    • SSL/TLS termination with ACM certificates
  • Amazon CloudFront
    • Global CDN for static assets, images, and API caching
    • Custom domain with Route 53 integration
  • Amazon Route 53
    • Hosted zone for domain management
    • Geolocation routing for multi-region setup
    • Health checks for failover

Security Services

  • AWS IAM
    • Service roles for ECS tasks, Lambda functions
    • OIDC provider for GitHub Actions CI/CD
    • Least privilege policies for all resources
  • AWS Secrets Manager
    • Database credentials rotation (every 30 days)
    • API keys for third-party integrations
    • Encryption keys management
  • AWS KMS
    • Customer-managed keys for S3, RDS, DynamoDB encryption
    • Separate keys per environment (dev, staging, prod)
  • AWS WAF
    • Rate limiting: 2000 requests per 5 minutes per IP
    • SQL injection and XSS protection rules
    • Geo-blocking for specific countries (if needed)
  • AWS Shield Standard (included by default)
  • AWS GuardDuty (threat detection)
  • AWS Security Hub (compliance monitoring)

Monitoring & Logging

  • Amazon CloudWatch
    • Logs: Centralized logging for all services (retention: 30 days)
    • Metrics: Custom metrics for business KPIs
    • Alarms: CPU, memory, disk, latency, error rates
    • Dashboards: Real-time operational visibility
  • AWS X-Ray (distributed tracing)
  • AWS CloudTrail (API audit logging, 90-day retention)

CI/CD Pipeline

  • AWS CodePipeline (orchestration)
  • AWS CodeBuild (build and test)
  • AWS CodeDeploy (deployment to ECS)
  • Amazon ECR (container registry)

Other Managed Services

  • Amazon SES (transactional emails)
  • Amazon SNS (notifications, alerts)
  • Amazon SQS (message queuing for async processing)
  • Amazon EventBridge (event routing)
  • AWS Backup (centralized backup management)

Infrastructure-as-Code Tools

Primary IaC: Terraform (recommended for multi-cloud portability and mature ecosystem)

  • Terraform v1.6+ with AWS Provider v5.x
  • State management: S3 backend with DynamoDB state locking
  • Modular structure: VPC, ECS, RDS, OpenSearch, monitoring modules
  • Environment management: Workspaces for dev/staging/prod
  • Secret management: Terraform Cloud or SOPS for sensitive variables

Alternative: AWS CDK (TypeScript) for teams preferring programmatic infrastructure

Configuration Management:

  • AWS Systems Manager Parameter Store for application configuration
  • AWS AppConfig for feature flags and dynamic configuration

Third-Party Tools/Platforms

Container Orchestration:

  • ECS Fargate (managed, no Kubernetes overhead needed for this use case)
  • Docker Engine 24.x for local development
  • Docker Compose for local multi-service testing

CI/CD Platform:

  • GitHub Actions (primary - free for public repos, integrated with AWS OIDC)
  • Alternative: GitLab CI or Jenkins for on-premise integration

Monitoring & Observability:

  • Datadog or New Relic (optional, for enhanced APM)
  • Grafana (self-hosted or Grafana Cloud) for custom dashboards
  • Prometheus (for Kubernetes if migrating from ECS in future)

SaaS Integrations:

  • Stripe for payment processing (subscription management)
  • Twilio for SMS notifications (optional)
  • Google Maps API or Mapbox for geocoding and maps
  • Algolia (optional alternative to OpenSearch for simpler search needs)
  • SendGrid (backup email provider)

Programming Languages & Frameworks

Backend Services:

  • Node.js 20.x LTS with Express.js or NestJS (microservices framework)
  • Python 3.11+ with FastAPI (for ML/analytics services)
  • Go 1.21+ (for high-performance services like search indexing)

Frontend:

  • React 18+ with Next.js 14 (SSR/SSG for SEO)
  • TypeScript 5.x (type safety)
  • Tailwind CSS or Material-UI for styling

Mobile (Optional Future Phase):

  • React Native or Flutter for cross-platform apps

Scripting & Automation:

  • Bash/Shell for deployment scripts
  • Python for data migration and ETL jobs
  • Node.js for Lambda functions

Libraries & Frameworks:

  • Sequelize/TypeORM (ORM for PostgreSQL)
  • AWS SDK (JavaScript, Python, Go)
  • OpenSearch JavaScript Client
  • Redis Client (ioredis)
  • Jest/Mocha (unit testing)
  • Cypress/Playwright (E2E testing)

Hardware/Compute Specifications

ECS Fargate Task Specifications:

  • API Gateway Service: 2 vCPU, 4GB RAM (handles routing, authentication)
  • Business Service: 2 vCPU, 4GB RAM (CRUD operations, complex queries)
  • User Service: 1 vCPU, 2GB RAM (lightweight user operations)
  • Review Service: 1 vCPU, 2GB RAM (moderate load)

Auto-scaling Configuration:

  • Target CPU Utilization: 70%
  • Target Memory Utilization: 80%
  • Scale-out cooldown: 60 seconds
  • Scale-in cooldown: 300 seconds
  • Min tasks: 2 per service (high availability)
  • Max tasks: 10-20 per service (based on load testing)

Lambda Configuration:

  • Image Processing: 1024MB, 60s timeout (handles image resize/optimization)
  • Search Indexing: 512MB, 30s timeout (bulk indexing to OpenSearch)
  • Email Service: 256MB, 15s timeout (SES integration)
  • Analytics: 1024MB, 120s timeout (aggregation jobs)

Database Sizing:

  • Aurora Primary: db.r6g.xlarge (4 vCPU, 32GB RAM) - handles 500-1000 TPS
  • Aurora Replicas: 2x db.r6g.large - distributes read load
  • Auto-scaling: Read replicas scale 2-5 based on CPU > 75%

OpenSearch Cluster:

  • Master Nodes: 3x c6g.large.search (dedicated for cluster management)
  • Data Nodes: 6x r6g.xlarge.search (search and indexing operations)
  • Storage per node: 200GB gp3 (total 1.2TB usable storage)
  • Replicas: 1 per index (2x storage requirement)

3. Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                          USER LAYER (Global)                                │
│  [Web Browser] [Mobile App] [API Clients]                                   │
└────────────────────────────┬────────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                     CONTENT DELIVERY NETWORK                                 │
│  ┌────────────────────────────────────────────────────────────────┐          │
│  │  Amazon CloudFront (Global Edge Locations)                      │         │
│  │  - Static Assets Caching                                        │         │
│  │  - API Response Caching (optional)                              │         │
│  │  - SSL/TLS Termination                                          │         │
│  └────────────────────────────────────────────────────────────────┘          │
└────────────────────────────┬────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          DNS & ROUTING LAYER                                │
│  ┌────────────────────────────────────────────────────────────────┐         │
│  │  Amazon Route 53                                               │         │
│  │  - Health Checks  - Geolocation Routing  - Failover            │         │
│  └────────────────────────────────────────────────────────────────┘         │
└────────────────────────────┬────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        SECURITY PERIMETER                                   │
│  ┌────────────────┐  ┌──────────────────┐  ┌─────────────────┐              │
│  │  AWS WAF       │  │  AWS Shield      │  │  AWS GuardDuty  │              │
│  │  - Rate Limit  │  │  - DDoS          │  │  - Threat Det.  │              │
│  │  - SQL Inject. │  │    Protection    │  │                 │              │
│  └────────────────┘  └──────────────────┘  └─────────────────┘              │
└────────────────────────────┬────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                   AWS REGION (us-east-1 / Primary)                          │
│                                                                             │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │              VPC (10.0.0.0/16)                                     │     │
│  │                                                                    │     │
│  │  ┌──────────────────────────────────────────────────────────────┐  │     │
│  │  │           PUBLIC SUBNETS (Multi-AZ)                          │  │     │
│  │  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │  │     │
│  │  │  │ 10.0.1.0/24  │  │ 10.0.2.0/24  │  │ 10.0.3.0/24  │        │  │     │
│  │  │  │   (AZ-a)     │  │   (AZ-b)     │  │   (AZ-c)     │        │  │     │
│  │  │  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘        │  │     │
│  │  │         │                  │                  │              │  │     │
│  │  │    [NAT GW-a]         [NAT GW-b]        [NAT GW-c]           │  │     │
│  │  │         │                  │                  │              │  │     │
│  │  │  ┌──────┴──────────────────┴──────────────────┴───────┐      │  │     │
│  │  │  │   Application Load Balancer (ALB)                  │      │  │     │
│  │  │  │   - SSL Termination (ACM Certificate)              │      │  │     │
│  │  │  │   - Target Groups for ECS Services                 │      │  │     │
│  │  │  └──────────────────────┬─────────────────────────────┘      │  │     │
│  │  └─────────────────────────┼───────────────────────────────────┘   │     │
│  │                            │                                       │     │
│  │  ┌─────────────────────────┼───────────────────────────────────┐   │    │
│  │  │      PRIVATE SUBNETS - APPLICATION TIER (Multi-AZ)          │  │    │
│  │  │  ┌──────────────┐  ┌────┴─────────┐  ┌──────────────┐       │  │    │
│  │  │  │ 10.0.11.0/24 │  │ 10.0.12.0/24 │  │ 10.0.13.0/24 │       │  │    │
│  │  │  │   (AZ-a)     │  │   (AZ-b)     │  │   (AZ-c)     │       │  │    │
│  │  │  └──────────────┘  └──────────────┘  └──────────────┘       │  │    │
│  │  │                                                               │  │    │
│  │  │  ┌────────────────────────────────────────────────────────┐ │  │    │
│  │  │  │         ECS FARGATE CLUSTER                            │ │  │    │
│  │  │  │  ┌─────────────────┐  ┌──────────────────┐            │ │  │    │
│  │  │  │  │ API Gateway Svc │  │ Business Mgmt    │            │ │  │    │
│  │  │  │  │ (2-20 tasks)    │  │ Service          │            │ │  │    │
│  │  │  │  │ 2vCPU/4GB       │  │ (2-15 tasks)     │            │ │  │    │
│  │  │  │  └─────────────────┘  └──────────────────┘            │ │  │    │
│  │  │  │  ┌─────────────────┐  ┌──────────────────┐            │ │  │    │
│  │  │  │  │ User Service    │  │ Review & Rating  │            │ │  │    │
│  │  │  │  │ (2-10 tasks)    │  │ Service          │            │ │  │    │
│  │  │  │  │ 1vCPU/2GB       │  │ (2-10 tasks)     │            │ │  │    │
│  │  │  │  └─────────────────┘  └──────────────────┘            │ │  │    │
│  │  │  └────────────────────────────────────────────────────────┘ │  │    │
│  │  │                                                               │  │    │
│  │  │  ┌────────────────────────────────────────────────────────┐ │  │    │
│  │  │  │         AWS LAMBDA FUNCTIONS                           │ │  │    │
│  │  │  │  [Image Processor] [Search Indexer] [Email Service]   │ │  │    │
│  │  │  │  [Analytics Aggregator]                                │ │  │    │
│  │  │  └────────────────────────────────────────────────────────┘ │  │    │
│  │  │                                                               │  │    │
│  │  │  ┌────────────────────────────────────────────────────────┐ │  │    │
│  │  │  │    ElastiCache for Redis (Cluster Mode)               │ │  │    │
│  │  │  │    - 2x cache.r6g.large nodes                          │ │  │    │
│  │  │  │    - Session cache, API cache, Listing cache          │ │  │    │
│  │  │  └────────────────────────────────────────────────────────┘ │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  │                                                                      │    │
│  │  ┌──────────────────────────────────────────────────────────────┐  │    │
│  │  │      PRIVATE SUBNETS - DATA TIER (Multi-AZ)                  │  │    │
│  │  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │  │    │
│  │  │  │ 10.0.21.0/24 │  │ 10.0.22.0/24 │  │ 10.0.23.0/24 │       │  │    │
│  │  │  │   (AZ-a)     │  │   (AZ-b)     │  │   (AZ-c)     │       │  │    │
│  │  │  └──────────────┘  └──────────────┘  └──────────────┘       │  │    │
│  │  │                                                               │  │    │
│  │  │  ┌────────────────────────────────────────────────────────┐ │  │    │
│  │  │  │   Amazon Aurora PostgreSQL (Multi-AZ)                  │ │  │    │
│  │  │  │   - Primary: db.r6g.xlarge (AZ-a)                      │ │  │    │
│  │  │  │   - Replica: db.r6g.large (AZ-b)                       │ │  │    │
│  │  │  │   - Replica: db.r6g.large (AZ-c)                       │ │  │    │
│  │  │  │   [Business, Users, Subscriptions, Transactions]       │ │  │    │
│  │  │  └────────────────────────────────────────────────────────┘ │  │    │
│  │  │                                                               │  │    │
│  │  │  ┌────────────────────────────────────────────────────────┐ │  │    │
│  │  │  │   Amazon OpenSearch Service (Multi-AZ)                 │ │  │    │
│  │  │  │   - 3x c6g.large.search (Master nodes)                 │ │  │    │
│  │  │  │   - 6x r6g.xlarge.search (Data nodes)                  │ │  │    │
│  │  │  │   [Full-text search, Geospatial queries, Analytics]    │ │  │    │
│  │  │  └────────────────────────────────────────────────────────┘ │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └──────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────┐    │
│  │                    REGIONAL MANAGED SERVICES                        │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌─────────────────┐           │    │
│  │  │  DynamoDB    │  │  S3 Buckets  │  │  SQS Queues     │           │    │
│  │  │  - Sessions  │  │  - Images    │  │  - Events       │           │    │
│  │  │  - Analytics │  │  - Backups   │  │  - Async Jobs   │           │    │
│  │  └──────────────┘  └──────────────┘  └─────────────────┘           │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌─────────────────┐           │    │
│  │  │  SNS Topics  │  │  SES         │  │  EventBridge    │           │    │
│  │  │  - Alerts    │  │  - Emails    │  │  - Event Router │           │    │
│  │  └──────────────┘  └──────────────┘  └─────────────────┘           │    │
│  └────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────┐    │
│  │                  MONITORING & SECURITY                              │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌─────────────────┐           │    │
│  │  │  CloudWatch  │  │  X-Ray       │  │  CloudTrail     │           │    │
│  │  │  - Logs      │  │  - Tracing   │  │  - Audit Logs   │           │    │
│  │  │  - Metrics   │  │              │  │                 │           │    │
│  │  └──────────────┘  └──────────────┘  └─────────────────┘           │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌─────────────────┐           │    │
│  │  │  Secrets Mgr │  │  KMS         │  │  Security Hub   │           │    │
│  │  │  - Creds     │  │  - Encrypt   │  │  - Compliance   │           │    │
│  │  └──────────────┘  └──────────────┘  └─────────────────┘           │    │
│  └────────────────────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                  CI/CD PIPELINE (GitHub / AWS)                               │
│  [GitHub] → [GitHub Actions] → [CodeBuild] → [ECR] → [CodeDeploy] → [ECS]  │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│         DISASTER RECOVERY REGION (us-west-2 / Secondary)                    │
│  [Standby Aurora Replica] [S3 Cross-Region Replication] [AMI Backups]      │
└─────────────────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Data Flow:

  1. User requests → CloudFront → Route 53 → WAF → ALB
  2. ALB → ECS Services (API Gateway → Business/User/Review services)
  3. Services → Aurora (write), Read Replicas (read), OpenSearch (search)
  4. Services → ElastiCache (cache check) → DynamoDB (sessions/analytics)
  5. Async operations → SQS → Lambda → S3/OpenSearch/SNS
  6. All logs → CloudWatch, traces → X-Ray, audit → CloudTrail

4. High Availability & Disaster Recovery

Multi-AZ Deployment Strategy

  • Compute: ECS tasks distributed across 3 AZs (us-east-1a, 1b, 1c)
  • Database: Aurora primary in AZ-a, replicas in AZ-b and AZ-c with automatic failover (30-120 seconds)
  • Search: OpenSearch deployed across 3 AZs with 1 replica shard per index
  • Cache: ElastiCache cluster mode with nodes in multiple AZs
  • Load Balancer: ALB with cross-zone load balancing enabled
  • NAT Gateways: 3 NAT Gateways (one per AZ) to eliminate single points of failure

Auto-Scaling Policies

ECS Service Auto-Scaling:

  • Metric: Target CPU 70%, Memory 80%
  • Scale-out: Add 50% capacity when threshold exceeded for 2 minutes
  • Scale-in: Remove 25% capacity when below 40% for 10 minutes
  • Cooldown: 60s scale-out, 300s scale-in

Aurora Read Replica Auto-Scaling:

  • Trigger: CPU > 75% for 5 minutes
  • Min replicas: 2, Max replicas: 5
  • Scale-in: CPU < 40% for 15 minutes

OpenSearch Auto-Scaling:

  • Storage: Auto-scale when 80% full (up to 3TB per node)
  • Manual scaling for data nodes based on query performance

Backup & Restore

Aurora PostgreSQL:

  • Automated backups: Daily, 7-day retention
  • Manual snapshots: Weekly, 30-day retention
  • Point-in-time recovery: Up to 5 minutes in the past
  • Cross-region backup: Daily snapshot copy to us-west-2

OpenSearch:

  • Automated snapshots: Hourly to S3 (24-hour retention)
  • Manual snapshots: Daily, 14-day retention
  • Restore time: ~15-30 minutes for 100GB index

DynamoDB:

  • Point-in-time recovery (PITR): Enabled, 35-day retention
  • On-demand backups: Weekly to S3

S3:

  • Versioning: Enabled on all buckets
  • Cross-region replication: Critical buckets to us-west-2
  • Lifecycle policies: Transition to Glacier after 365 days

RTO/RPO Targets

Component RPO (Data Loss) RTO (Downtime) Mechanism
Aurora DB < 5 minutes < 2 minutes Multi-AZ automated failover
OpenSearch < 1 hour < 30 minutes Snapshot restore
DynamoDB < 1 second < 1 minute Multi-AZ replication
ECS Services 0 (stateless) < 1 minute Auto-scaling, health checks
S3 0 (versioning) Immediate Multi-AZ storage

Failover Mechanisms

  • DNS Failover: Route 53 health checks with automatic failover to DR region (TTL: 60s)
  • Database Failover: Aurora automatic failover to read replica (30-120s)
  • Application Failover: ALB health checks remove unhealthy targets in 30s
  • Cache Failover: ElastiCache automatic node replacement in cluster mode

5. Security Implementation

Network Security

Security Groups:

  • ALB-SG: Inbound 443 (0.0.0.0/0), Outbound 8080 (ECS-SG)
  • ECS-SG: Inbound 8080 (ALB-SG), Outbound 443 (all), 5432 (RDS-SG), 9200 (OpenSearch-SG), 6379 (Cache-SG)
  • RDS-SG: Inbound 5432 (ECS-SG), Outbound none
  • OpenSearch-SG: Inbound 9200, 9300 (ECS-SG), Outbound none
  • Cache-SG: Inbound 6379 (ECS-SG), Outbound none

NACLs:

  • Public subnets: Allow 80, 443 inbound, ephemeral outbound
  • Private subnets: Deny all inbound from internet, allow VPC CIDR
  • Data subnets: Deny all except from application subnet CIDR

AWS WAF Rules:

  • Rate limiting: 2000 requests per 5 minutes per IP
  • SQL Injection: AWS Managed Rules (SQLi_QUERYARGUMENTS)
  • XSS: AWS Managed Rules (XSS_BODY, XSS_COOKIE)
  • Geographic blocking: Block traffic from high-risk countries (optional)
  • IP reputation lists: AWS IP reputation managed rule group

IAM Roles & Policies (Least Privilege)

ECS Task Execution Role:

{
  "Effect": "Allow",
  "Action": [
    "ecr:GetAuthorizationToken",
    "ecr:BatchGetImage",
    "logs:CreateLogStream",
    "logs:PutLogEvents",
    "secretsmanager:GetSecretValue"
  ]
}
Enter fullscreen mode Exit fullscreen mode

ECS Task Role (per service):

  • Business Service: RDS access, S3 read/write, OpenSearch write
  • User Service: RDS access, DynamoDB access, SES send
  • API Gateway: No direct resource access (delegates to services)

Lambda Execution Roles:

  • Image Processor: S3 read/write, CloudWatch Logs
  • Search Indexer: OpenSearch write, SQS read, CloudWatch Logs

Data Encryption

At-Rest:

  • Aurora: KMS encryption (customer-managed key: alias/directory-db)
  • OpenSearch: KMS encryption (customer-managed key: alias/directory-search)
  • DynamoDB: KMS encryption (customer-managed key: alias/directory-nosql)
  • S3: SSE-S3 for non-sensitive, SSE-KMS for sensitive data
  • EBS (OpenSearch): KMS encryption enabled

In-Transit:

  • ALB → Clients: TLS 1.2+ (ACM certificate)
  • ECS → RDS: TLS enforced (require_secure_transport=ON)
  • ECS → OpenSearch: HTTPS only
  • ECS → ElastiCache: Redis AUTH + TLS enabled
  • Inter-service: Internal ALB with TLS

Secrets Management

  • AWS Secrets Manager: Database passwords, API keys, OAuth tokens
  • Rotation: Automated 30-day rotation for RDS credentials
  • Access: IAM policy enforcement, CloudTrail logging of all access
  • Encryption: All secrets encrypted with KMS

Compliance Considerations

  • PCI-DSS: If handling payments (Stripe integration reduces scope)
  • GDPR: Data residency controls, encryption, right to deletion (S3 lifecycle)
  • SOC 2: CloudTrail audit logs, encryption at rest/in transit
  • HIPAA: Not applicable unless health-related businesses require it

DDoS Protection

  • AWS Shield Standard: Automatic protection (included)
  • AWS Shield Advanced: Optional (\$3000/month) for advanced protection
  • CloudFront: Absorbs layer 3/4 attacks
  • WAF Rate Limiting: Application-layer protection
  • Auto-scaling: Absorbs traffic spikes

6. Well-Architected Framework Alignment

Operational Excellence

  • Infrastructure as Code: 100% Terraform-managed infrastructure, version controlled in Git
  • Monitoring: CloudWatch dashboards for all services, custom metrics for business KPIs (searches/min, listings created/hour)
  • Alerting: SNS notifications for critical alarms (CPU > 85%, error rate > 1%, latency > 2s)
  • Automation: CI/CD pipeline with automated testing, blue-green deployments, automated backups
  • Runbooks: Documented incident response procedures in Confluence/Notion
  • Game Days: Quarterly chaos engineering exercises (failover testing)

Security

  • Identity Management: IAM roles with least privilege, MFA enforced for console access, OIDC for CI/CD
  • Detective Controls: GuardDuty threat detection, CloudTrail audit logging (90-day retention), Security Hub compliance dashboards
  • Data Protection: KMS encryption (at-rest), TLS 1.2+ (in-transit), Secrets Manager rotation, S3 versioning
  • Incident Response: Automated alerting via SNS, CloudWatch Logs Insights for forensics, AWS Config for compliance tracking
  • Network Protection: VPC isolation, security groups, NACLs, WAF rules, private subnets for data tier

Reliability

  • Fault Tolerance: Multi-AZ deployment (3 AZs), ECS tasks across AZs, Aurora Multi-AZ, OpenSearch replicas
  • Backup Strategy: Automated daily backups (Aurora, OpenSearch, DynamoDB PITR), cross-region replication for critical data
  • Auto-Healing: ECS health checks replace failed tasks, Aurora automatic failover, ALB removes unhealthy targets
  • Change Management: Blue-green deployments, canary releases, automated rollback on failure
  • Monitoring: Real-time CloudWatch metrics, X-Ray distributed tracing, synthetic monitoring (CloudWatch Synthetics)

Performance Efficiency

  • Right-Sizing: Graviton2 instances (r6g, c6g) for 20% better price-performance, Auto-scaling based on metrics
  • Caching: CloudFront CDN (global edge), ElastiCache Redis (API responses, sessions, listings), Aurora query cache
  • Database Optimization: Read replicas for read-heavy workloads, Aurora I/O-Optimized for predictable costs
  • Search Optimization: OpenSearch with proper shard sizing (10-50GB per shard), hot/warm architecture for time-series data
  • CDN Usage: CloudFront for static assets, images, and optionally API responses (reduces origin load by 60-80%)

Cost Optimization

  • Resource Optimization: Fargate Spot for non-critical tasks (70% savings), S3 Intelligent-Tiering, EBS gp3 over gp2
  • Reserved Capacity: 1-year RDS Reserved Instances (40% savings), ElastiCache Reserved Nodes (30% savings)
  • Savings Plans: Compute Savings Plans for ECS Fargate (up to 50% savings)
  • Rightsizing: CloudWatch metrics to identify underutilized resources, Lambda for event-driven tasks
  • Monitoring: AWS Cost Explorer, Budget alerts at 80% threshold, Trusted Advisor cost checks

Sustainability

  • Resource Efficiency: Graviton2 processors (60% better energy efficiency), auto-scaling prevents idle resources
  • Minimal Idle: Shut down dev/staging environments off-hours (Lambda scheduler), DynamoDB on-demand for variable workloads
  • Managed Services: Leverage AWS-managed services (reduced carbon footprint vs self-managed)
  • Data Lifecycle: S3 lifecycle policies archive old data, delete unnecessary logs after 30 days

7. Deployment Flow

Step-by-Step Deployment Process

Phase 1: Infrastructure Provisioning (Terraform)

  1. VPC & Networking: Deploy VPC, subnets, NAT gateways, route tables, security groups
  2. Data Layer: Provision Aurora cluster, DynamoDB tables, OpenSearch domain, ElastiCache cluster
  3. Compute Layer: Create ECS cluster, task definitions, ALB, target groups
  4. Storage: Create S3 buckets with versioning, lifecycle policies
  5. Security: Configure KMS keys, Secrets Manager secrets, IAM roles/policies
  6. Monitoring: Set up CloudWatch log groups, dashboards, alarms, SNS topics

Phase 2: Application Deployment

  1. Container Build: GitHub Actions triggers on merge to main
  2. CodeBuild: Builds Docker images, runs unit tests (Jest/Mocha)
  3. Security Scan: Trivy/Snyk scans images for vulnerabilities
  4. ECR Push: Successful builds push to Amazon ECR
  5. Database Migration: Run Flyway/Liquibase migrations (automated in CodePipeline)
  6. ECS Deployment: CodeDeploy updates ECS services with new task definitions

CI/CD Pipeline Architecture

GitHub → GitHub Actions → CodeBuild → ECR → CodeDeploy → ECS
   │           │              │          │         │         │
   │           │              │          │         │         └─→ Health Checks
   │           │              │          │         └─→ Blue/Green Deploy
   │           │              │          └─→ Image Versioning
   │           │              └─→ Unit/Integration Tests
   │           └─→ Terraform Plan (on PR)
   └─→ Trigger on Push/PR
Enter fullscreen mode Exit fullscreen mode

GitHub Actions Workflow:

name: Deploy to Production
on:
  push:
    branches: [main]
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - Checkout code
      - Configure AWS credentials (OIDC)
      - Build Docker images
      - Run tests
      - Push to ECR
      - Update ECS task definition
      - Trigger CodeDeploy (Blue/Green)
      - Run smoke tests
      - Notify Slack/Email on status
Enter fullscreen mode Exit fullscreen mode

Blue-Green Deployment Strategy

ECS Blue/Green with CodeDeploy:

  1. Blue (Current): Production traffic on task set v1.2.3
  2. Green (New): Deploy task set v1.2.4 to same cluster
  3. Test Traffic: Route 10% traffic to Green for 5 minutes
  4. Health Check: Monitor error rates, latency, success rate
  5. Full Cutover: If healthy, route 100% traffic to Green
  6. Terminate Blue: Keep Blue for 1 hour, then terminate
  7. Rollback: If issues, instant rollback to Blue (< 60s)

Canary Deployment (Alternative for Lambda):

  • Deploy new Lambda version
  • Route 10% traffic → wait 5 min → 25% → wait 10 min → 50% → 100%
  • Automated rollback on CloudWatch alarms (error rate, duration)

Rollback Procedures

Automated Rollback:

  • CodeDeploy: Automatic rollback on CloudWatch alarm (error rate > 1%)
  • Trigger alarms: HTTP 5xx > 10 requests/min, Latency > 3s P99

Manual Rollback:

  1. Identify previous stable task definition/image tag
  2. Update ECS service with previous task definition
  3. Force new deployment (drains old tasks, starts new)
  4. Verify health via CloudWatch metrics and logs
  5. Time: < 5 minutes for complete rollback

8. Monitoring & Operations

Key Metrics to Monitor

Application Metrics:

  • Request Rate: Requests per second (RPS), searches per minute
  • Latency: P50, P90, P99, P99.9 response times
  • Error Rate: HTTP 4xx, 5xx errors per minute
  • Availability: Uptime percentage (target: 99.95%)
  • Business Metrics: New listings/hour, user registrations/day, search conversion rate

Infrastructure Metrics:

  • ECS: CPU utilization, memory utilization, task count, health check failures
  • Aurora: CPU, connections, read/write latency, replica lag, deadlocks
  • OpenSearch: Cluster status, JVM memory, indexing rate, search latency, shard status
  • ElastiCache: CPU, evictions, cache hit rate, connections, network I/O
  • ALB: Target response time, healthy/unhealthy host count, request count, 5xx errors

Alerting Thresholds

Metric Warning Critical Action
ECS CPU > 70% > 85% Scale out tasks
ECS Memory > 75% > 90% Scale out tasks
Aurora CPU > 70% > 85% Add read replica
Aurora Connections > 500 > 700 Investigate leaks
OpenSearch JVM > 75% > 85% Scale data nodes
ElastiCache Hit Rate < 80% < 60% Review cache strategy
API Latency P99 > 2s > 3s Investigate bottleneck
Error Rate > 0.5% > 1% Page on-call engineer

Log Aggregation Strategy

CloudWatch Logs:

  • Application Logs: /aws/ecs/directory-api, /aws/ecs/directory-business
  • Access Logs: /aws/alb/directory-alb
  • Lambda Logs: /aws/lambda/directory-*
  • Database Logs: Aurora slow query logs (queries > 1s)
  • Retention: 30 days (compliance), export to S3 for long-term storage

Log Analysis:

  • CloudWatch Logs Insights: Query logs for patterns, errors, slow requests
  • X-Ray Service Map: Visualize service dependencies, trace requests end-to-end
  • Example Query: fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc

Dashboard Requirements

Operational Dashboard (Real-time):

  • Service health status (green/yellow/red)
  • Request rate, error rate, latency (last 1 hour)
  • Active ECS tasks, database connections
  • OpenSearch cluster health, cache hit rate
  • Current auto-scaling activity

Business Dashboard (Daily/Weekly):

  • Total business listings (active/inactive)
  • New user registrations, daily active users
  • Search queries (total, by category, by location)
  • Revenue metrics (premium listings, ad clicks)
  • Conversion funnel (search → view → contact)

Cost Dashboard:

  • Daily spend by service (EC2, RDS, OpenSearch, data transfer)
  • Month-to-date vs budget
  • Forecast for month-end spending
  • Top 10 cost drivers

Incident Response Workflow

  1. Detection: CloudWatch alarm triggers SNS notification to PagerDuty/Slack
  2. Acknowledgment: On-call engineer acknowledges within 5 minutes
  3. Investigation: Check CloudWatch dashboards, logs, X-Ray traces
  4. Mitigation: Execute runbook (rollback, scale up, restart service)
  5. Communication: Update status page, notify stakeholders
  6. Resolution: Verify metrics return to normal, close incident
  7. Post-Mortem: Document root cause, corrective actions (within 48 hours)

9. Cost Estimation

Production Environment Monthly Costs (Assumptions: 1M listings, 10M searches/month, 100K DAU)

Service Configuration Quantity Unit Cost Monthly Cost
Compute \$1,458
ECS Fargate 2vCPU, 4GB (API Gateway) 10 tasks avg \$0.08468/hr \$622
ECS Fargate 2vCPU, 4GB (Business) 8 tasks avg \$0.08468/hr \$498
ECS Fargate 1vCPU, 2GB (User/Review) 8 tasks avg \$0.04234/hr \$248
Lambda 512MB, 5M invocations 10s avg \$0.20/1M \$90
Database \$1,247
Aurora Primary db.r6g.xlarge 1 instance \$0.52/hr \$380
Aurora Replicas db.r6g.large 2 instances \$0.26/hr ea. \$380
Aurora Storage 500GB 500GB \$0.10/GB \$50
Aurora I/O I/O-Optimized Included \$0 \$0
Aurora Backup 500GB 500GB \$0.021/GB \$11
DynamoDB On-demand 10GB, 10M R, 2M W Variable \$26
ElastiCache cache.r6g.large 2 nodes \$0.218/hr \$320
Search \$1,833
OpenSearch Master c6g.large.search 3 nodes \$0.113/hr \$248
OpenSearch Data r6g.xlarge.search 6 nodes \$0.371/hr \$1,628
OpenSearch Storage gp3 200GB per node 1200GB Included \$0
Storage \$178
S3 Standard Images, assets 2TB \$0.023/GB \$47
S3 Requests PUT/GET 100M \$0.005/10K \$50
S3 Data Transfer Out to internet 1TB \$0.09/GB \$90
EBS Snapshots Backups 400GB \$0.05/GB \$20
Networking \$387
ALB 2 ALBs 730 hrs \$0.0252/hr \$37
ALB LCU ~2 LCU avg 1460 hrs \$0.008/hr \$12
NAT Gateway 3 NAT Gateways 2190 hrs \$0.045/hr \$99
NAT Data Transfer 1TB processed 1TB \$0.045/GB \$46
CloudFront 2TB out, 100M req Variable \$0.085/GB \$193
Security & Mgmt \$117
Secrets Manager 10 secrets 10 \$0.40/secret \$4
KMS 3 keys, 1M requests 3 + requests \$1 + \$0.03/10K \$7
WAF 1 ACL, 5 rules Variable \$5 + \$1/rule \$10
CloudWatch Logs 50GB ingested 50GB \$0.50/GB \$25
CloudWatch Metrics 500 custom 500 \$0.30/metric \$150
GuardDuty Account analysis 1 account ~\$3/day \$90
Others \$43
Route 53 1 hosted zone 1 \$0.50/zone \$1
Route 53 Queries 100M queries 100M \$0.40/1M \$40
SES 100K emails 100K \$0.10/1K \$10
SNS 10K notifications 10K \$0.50/1M \$1
SQS 50M requests 50M \$0.40/1M \$20
CodePipeline 1 pipeline 1 \$1/pipeline \$1
ECR Storage 50GB 50GB \$0.10/GB \$5
TOTAL PRODUCTION \$5,263/month

Development Environment Monthly Costs

Service Configuration Monthly Cost
ECS Fargate 50% of prod tasks \$350
Aurora db.r6g.large (1 instance) \$190
OpenSearch 3 nodes (smaller) \$600
ElastiCache 1 node \$160
Other services 30% of prod \$400
TOTAL DEV \$1,700/month

Total Estimated Monthly Cost

  • Production: \$5,263
  • Development: \$1,700
  • Total: \$6,963/month (~\$83,556/year)

Cost Optimization Recommendations

  1. Reserved Instances (1-year, No Upfront):
    • Aurora: Save \$2,736/year (40% on \$570/month)
    • ElastiCache: Save \$1,152/year (30% on \$320/month)
    • Total Savings: ~\$3,888/year
  2. Compute Savings Plans:
    • ECS Fargate: Save ~\$600/year (30% on \$1,458/month compute)
  3. Right-Sizing:
    • Monitor CloudWatch metrics for 30 days, downsize underutilized instances
    • Potential savings: 10-15% (\$500-750/month)
  4. Dev Environment Automation:
    • Auto-shutdown off-hours (nights, weekends): Save ~\$850/month (50% of dev costs)
    • Lambda scheduler to stop/start resources
  5. S3 Optimization:
    • Implement S3 Lifecycle policies (Standard → IA → Glacier)
    • Potential savings: 30% on old assets (\$15-20/month)
  6. OpenSearch Alternative:
    • For lower search volumes, consider Algolia (managed, pay-per-search)
    • Break-even: ~50K searches/month vs self-managed OpenSearch

Optimized Production Cost: ~\$4,000-4,500/month with reservations and automation


10. Implementation Roadmap

Phase 1: Foundation (Weeks 1-4)

Week 1-2: Infrastructure Setup

  • Set up AWS Organizations, multi-account structure (dev/staging/prod)
  • Configure Terraform state backend (S3 + DynamoDB)
  • Deploy VPC, subnets, security groups, NAT gateways
  • Set up IAM roles, KMS keys, Secrets Manager
  • Deliverable: Complete network infrastructure

Week 3-4: Data Layer

  • Provision Aurora PostgreSQL cluster with read replicas
  • Deploy OpenSearch domain with proper sizing
  • Create DynamoDB tables (sessions, analytics)
  • Set up ElastiCache Redis cluster
  • Configure S3 buckets with lifecycle policies
  • Deliverable: Functional data layer with backups

Phase 2: Application Development (Weeks 5-10)

Week 5-6: Core Services

  • Develop User Service (authentication, registration, profile)
  • Develop Business Service (CRUD, validation, approval workflow)
  • Implement database schema and migrations
  • Unit tests (80% coverage target)
  • Deliverable: Core microservices with tests

Week 7-8: Search & Discovery

  • Integrate OpenSearch with Business Service
  • Implement geospatial search (radius, location-based)
  • Build category taxonomy and filtering
  • Develop Search Service API
  • Deliverable: Working search functionality

Week 9-10: Supporting Services

  • Review & Rating Service
  • Image upload/processing Lambda
  • Email notification service (SES integration)
  • Admin panel backend
  • Deliverable: Complete backend services

Phase 3: Frontend & Integration (Weeks 11-14)

Week 11-12: Web Application

  • Next.js frontend with SSR for SEO
  • Search interface with filters
  • Business listing pages
  • User dashboard
  • Deliverable: Functional web application

Week 13-14: Integration & Testing

  • Integration testing (Cypress/Playwright)
  • Performance testing (JMeter/k6)
  • Security testing (OWASP ZAP)
  • UAT with stakeholders
  • Deliverable: Tested, integrated system

Phase 4: DevOps & Production (Weeks 15-18)

Week 15-16: CI/CD Pipeline

  • Set up GitHub Actions workflows
  • Configure CodePipeline, CodeBuild, CodeDeploy
  • Implement blue-green deployment
  • Container security scanning
  • Deliverable: Automated deployment pipeline

Week 17: Monitoring & Observability

  • CloudWatch dashboards and alarms
  • X-Ray distributed tracing
  • Log aggregation and analysis setup
  • PagerDuty/Slack integration
  • Deliverable: Complete monitoring system

Week 18: Production Deployment

  • Production infrastructure deployment
  • Database migration and seed data
  • DNS cutover (Route 53)
  • Go-live checklist execution
  • Deliverable: Live production system

Phase 5: Optimization & Scaling (Weeks 19-22)

Week 19-20: Performance Optimization

  • Implement caching strategies
  • Database query optimization
  • OpenSearch index tuning
  • CDN configuration
  • Deliverable: Optimized performance

Week 21-22: Documentation & Handover

  • Architecture documentation
  • Runbooks and playbooks
  • Team training
  • Knowledge transfer
  • Deliverable: Complete documentation

Timeline Estimate: 22 weeks (5.5 months)

Critical Path Items

  1. VPC and networking setup (blocking all else)
  2. Database provisioning (blocking application development)
  3. Core services development (blocking frontend)
  4. OpenSearch integration (blocking search features)
  5. CI/CD pipeline (blocking production deployment)

Team Skill Requirements

Role Count Skills Required
Solutions Architect 1 AWS, System Design, Terraform
Backend Engineers 3 Node.js/Python, Microservices, Databases
Frontend Engineer 2 React, Next.js, TypeScript
DevOps Engineer 1 Terraform, CI/CD, AWS, Docker
QA Engineer 1 Testing frameworks, Automation
Product Manager 1 Requirements, Stakeholder management

Total Team: 9 people


11. Assumptions & Prerequisites

Traffic/User Load Assumptions

  • Daily Active Users (DAU): 100,000
  • Monthly Active Users (MAU): 500,000
  • Peak Concurrent Users: 10,000
  • Average Requests per User: 20/session
  • Search Queries: 10 million/month
  • New Listings: 10,000/month
  • Total Business Listings: 1 million (initial), growing 1% monthly
  • Peak Traffic: 3x average (during business hours, marketing campaigns)
  • Geographic Distribution: 70% US, 20% EU, 10% APAC

Data Volume Assumptions

  • Database Size: 500GB initially, growing 50GB/month
  • Images/Assets: 2TB initially, growing 100GB/month
  • Log Data: 50GB/month
  • Backup Storage: 1TB total
  • OpenSearch Index: 100GB initially, growing 10GB/month
  • Average Business Listing: 5KB (text + metadata)
  • Average Image: 500KB (after compression)

Availability Requirements

  • Target Uptime: 99.95% (4.38 hours downtime/year)
  • Maintenance Windows: Monthly, 2 AM - 4 AM EST, < 30 min
  • RTO (Recovery Time Objective): 2 hours
  • RPO (Recovery Point Objective): 5 minutes

Required Team Expertise

  • AWS Services: VPC, ECS, RDS Aurora, OpenSearch, CloudFormation/Terraform
  • Programming: Node.js/Python, SQL, JavaScript/TypeScript
  • DevOps: Docker, CI/CD, Infrastructure as Code
  • Databases: PostgreSQL, DynamoDB, Redis, OpenSearch/Elasticsearch
  • Frontend: React, Next.js, responsive design

Existing Infrastructure Considerations

  • Greenfield Deployment: No existing infrastructure (fresh AWS account)
  • Domain Name: Owned, ready to transfer to Route 53
  • SSL Certificates: Will be provisioned via ACM
  • Third-Party Integrations: Stripe account, Google Maps API key
  • Data Migration: Not applicable (new platform)

12. Risks & Mitigations

Technical Risks

Risk Impact Probability Mitigation
OpenSearch cost overrun High Medium Monitor query patterns, implement caching, consider Aurora for simple searches
Database performance bottleneck High Medium Aurora read replicas, query optimization, caching layer, connection pooling
NAT Gateway costs exceed budget Medium High VPC endpoints for AWS services (S3, DynamoDB), review data transfer patterns
Lambda cold starts impact UX Medium Medium Provisioned concurrency for critical functions, use ECS for latency-sensitive
OpenSearch cluster downtime High Low Multi-AZ deployment, automated snapshots, documented restore procedures
Data transfer costs Medium High CloudFront caching, compress assets, S3 Transfer Acceleration
Security breach Critical Low WAF, GuardDuty, Security Hub, regular audits, pen testing, compliance checks
Vendor lock-in to AWS Medium High Use Terraform (portable IaC), abstract AWS SDK calls, document alternatives

Mitigation Strategies

Cost Management:

  • Budget Alerts: Set CloudWatch billing alarms at 80%, 90%, 100% of budget
  • Regular Reviews: Monthly cost analysis, identify anomalies
  • Reserved Capacity: Purchase RIs after 3 months of stable usage patterns
  • Right-Sizing: Quarterly review of instance utilization, downsize underutilized

Performance Assurance:

  • Load Testing: Pre-launch testing with 2x expected peak load
  • Performance Monitoring: Real-time CloudWatch dashboards, alert on P99 > 2s
  • Capacity Planning: Quarterly forecast based on growth trends
  • Caching Strategy: Multi-layer (CloudFront, ElastiCache, in-memory)

Disaster Recovery:

  • Quarterly DR Drills: Test failover to DR region, measure RTO/RPO
  • Backup Verification: Monthly restore testing from snapshots
  • Chaos Engineering: Simulate failures (random task termination, AZ outage)

Security Hardening:

  • Penetration Testing: Annual third-party pen test
  • Compliance Audits: Quarterly internal audits (SOC 2, GDPR)
  • Security Training: Developer security training, secure coding practices
  • Patch Management: Automated OS patching (Systems Manager Patch Manager)

Alternative Approaches Considered

1. Serverless-First Architecture (Lambda + API Gateway)

  • Pros: Lower cost at low scale, no infrastructure management
  • Cons: Cold starts, timeout limits, complex orchestration, vendor lock-in
  • Rejected: Complex business logic better suited for long-running services

2. Kubernetes (EKS) Instead of ECS

  • Pros: Industry standard, multi-cloud portability, rich ecosystem
  • Cons: Higher operational complexity, steeper learning curve, higher costs
  • Rejected: ECS Fargate simpler for this use case, team expertise

3. Self-Managed Elasticsearch Instead of OpenSearch Service

  • Pros: More control, potentially lower cost
  • Cons: Operational overhead, patching, scaling complexity
  • Rejected: Managed service reduces toil, built-in HA

4. Aurora Serverless v2 Instead of Provisioned

  • Pros: Auto-scaling, pay-per-use
  • Cons: Less predictable costs, cold start delays, ACU pricing complexity
  • Decision: Use provisioned for predictable workloads, consider serverless for dev/staging

5. NoSQL-Only (DynamoDB) Instead of Relational

  • Pros: Unlimited scale, low latency
  • Cons: Complex queries difficult, no transactions (at scale), data modeling complexity
  • Rejected: Relational model better for business directory use case (joins, ACID)

Success Criteria

Performance: P99 search latency < 500ms, listing page load < 1s

Availability: 99.95% uptime, max 4.38 hours downtime/year

Scalability: Handle 10x traffic growth without architecture changes

Cost: Stay within \$6,000/month production budget (optimize to \$4,500)

Security: Pass security audit, zero critical vulnerabilities

Recovery: Achieve RTO < 2 hours, RPO < 5 minutes in DR tests


This comprehensive solution provides a production-ready, highly available online business directory platform following AWS Well-Architected Framework principles. The architecture balances performance, cost, and operational simplicity using managed AWS services, enabling rapid deployment and scalable growth.

Top comments (0)