Aisalkyn Aidarova

Posted on Feb 26

Part 3: Decoupled Architecture

#aws #devops #tutorial #architecture

Prepare the existing EC2 (source instance)

Confirm the instance is fully configured (packages, files, worker.py, configs).
Verify the file you care about exists (example):
- ls /home/ubuntu
- cat /home/ubuntu/worker.py | head

Create a custom AMI from the existing EC2

EC2 → Instances → select the configured instance (e.g., worker-server)
Actions → Image and templates → Create image
Name it (example: image-sqs-template)
Keep Reboot instance enabled (recommended)
Create image
EC2 → AMIs → wait until Status = Available

Create a Launch Template using that AMI

EC2 → Launch templates → Create launch template
Choose the AMI from My AMIs: image-sqs-template (example AMI: ami-00be6270b4bced645)
Set instance type (e.g., t2.small / t2.micro)
Select key pair (e.g., key)
Select security group (same one you used before)
Advanced details → IAM instance profile: select your role (e.g., worker-ec2-role)
Leave User data empty (because config is already baked into AMI)
Click Create launch template

Create the Auto Scaling Group (ASG)

EC2 → Auto Scaling Groups → Create Auto Scaling group
Name the ASG (example: asg-ana)
Select launch template: my-lc-template
Version: Latest (important)
Choose VPC (default VPC is fine)
Select subnets in 2–3 AZs (e.g., us-east-2a, 2b, 2c)
Load balancer: No load balancer (for worker pattern)
Health checks: EC2 only
Grace period: 60 seconds

Set capacity

Desired = 1
Min = 1
Max = 2
Scaling policies: None (for lab)

Create ASG and verify

Click Create Auto Scaling group
EC2 → Instances → open the new instance created by ASG
Confirm:
- Auto Scaling group name shows your ASG
- AMI ID matches your custom AMI
- IAM role attached correctly
SSH to new instance and verify config/files:
- ls /home/ubuntu → confirm worker.py exists

Optional (production best practice)

If you want worker.py to start automatically on every new ASG instance, create:
- a systemd service, or
- user-data bootstrap script.

You currently have:

S3 → SNS → SQS → EC2 Worker (ASG)

Why Replace EC2 Worker With Lambda?

Your worker does this:

Wait for SQS message
Process file
Move file in S3

That is event-driven processing.

Lambda is designed exactly for:

Event-driven workloads.

EC2 Worker vs Lambda

EC2 Worker (What You Built)

Pros:

Full control
Long-running tasks possible
Custom OS-level configuration

Cons:

You manage servers
You patch OS
You manage scaling rules
You pay even when idle
Cold worker may exist doing nothing

Lambda

Pros:

No server management
Auto-scales automatically
Pay only per execution
Built-in SQS integration
No ASG needed

Cons:

15-minute execution limit
Limited control over OS
Cold starts possible

When Production Teams Switch to Lambda

They switch when:

Workload is event-driven
Tasks are short-lived
Traffic is unpredictable
They want lower operational overhead

Your worker is perfect Lambda candidate.

How Lambda Autoscaling Works

With SQS trigger:

Lambda automatically:

Polls SQS
Creates parallel executions
Scales based on queue depth

If 1 message → 1 Lambda invocation
If 1000 messages → many concurrent Lambda executions

No ASG needed.

No instance management.

Real Technical Explanation

Lambda + SQS integration:

AWS manages:

Polling SQS
Batching messages
Concurrency control
Scaling workers automatically

Scaling happens based on:

ApproximateNumberOfMessagesVisible

It increases concurrency automatically.

Example Scaling Comparison

EC2 ASG:

You configure scaling policy
Based on CPU or SQS metric
ASG launches new VM (takes 1–2 minutes)

Lambda:

No VM
No boot time
New execution in milliseconds
Fully managed scaling

Why Big Companies Use Lambda for Workers

Because:

No infrastructure management
No patching
No OS vulnerabilities
No idle cost
Perfect for background processing

For web servers → EC2 or ECS or Kubernetes
For background jobs → often Lambda

Your Architecture With Lambda

Instead of:

SQS → EC2 Worker

It becomes:

SQS → Lambda → S3

Much simpler.

Does Lambda Replace EC2 Completely?

No.

Lambda is best for:

Background jobs
Image processing
Notifications
API microservices (small)

Not good for:

Long-running processes
High-memory heavy tasks
Stateful services

Cost Example

If worker idle 23 hours/day:

EC2:
You pay 24 hours.

Lambda:
You pay only for execution time.

So When Should You Switch?

If:

Processing is short (<15 minutes)
Stateless
Event-driven
Spiky workload

Then Lambda is better.

In Production Architecture

Very common pattern:

Web Tier → EC2 / ALB
Background Jobs → Lambda
Streaming → Kinesis
Storage → S3

Important DevOps Skill

You should know both:

How to scale EC2 with ASG
How to build serverless with Lambda

Interviewers love when you compare both.

0) Confirm what you have now

You already have:

S3 bucket with folders/prefixes: uploads/, processed/
SNS topic wired to SQS (S3 event → SNS → SQS) or S3 event direct to SNS
SQS queue receiving messages
Worker logic that:

reads SQS message
gets S3 object key from event
copies/moves object from uploads/ to processed/

Lambda will do the same, but AWS will poll SQS automatically.

1) Make sure your SQS message format is what Lambda will read

Your messages are likely SNS-wrapped S3 events (common pattern):

Lambda receives SQS record
record["body"] might be:
- an SNS envelope JSON, with "Message" containing the actual S3 event JSON
- or directly the S3 event JSON

We’ll support both formats in the Lambda code.

2) Create IAM Role for Lambda

2.1 Create a role

AWS Console → IAM → Roles → Create role

Trusted entity: AWS service
Use case: Lambda
Create role name: lambda-sqs-s3-worker-role

2.2 Attach permissions (minimum required)

Attach (or create a custom policy) with:

A) CloudWatch logs

AWSLambdaBasicExecutionRole (managed policy)

B) Read from SQS

Either managed: AWSLambdaSQSQueueExecutionRole (recommended)
- or custom SQS permissions:
- sqs:ReceiveMessage
- sqs:DeleteMessage
- sqs:GetQueueAttributes
- sqs:ChangeMessageVisibility

C) Access S3 objects
Custom policy example (replace bucket name):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3ReadWriteSpecificBucket",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
      "Resource": ["arn:aws:s3:::YOUR_BUCKET_NAME/*"]
    },
    {
      "Sid": "S3ListBucketForPrefixOps",
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": ["arn:aws:s3:::YOUR_BUCKET_NAME"]
    }
  ]
}

3) Create the Lambda function

AWS Console → Lambda → Create function

Author from scratch
Name: sqs-s3-worker
Runtime: Python 3.12 (or 3.11)
Execution role: Use existing role → lambda-sqs-s3-worker-role

4) Add environment variables

Lambda → Configuration → Environment variables:

REGION = us-east-2
BUCKET = YOUR_BUCKET_NAME
SRC_PREFIX = uploads/
DST_PREFIX = processed/

(You can hardcode, but env vars are cleaner.)

5) Paste Lambda code (supports SNS-wrapped + direct S3 events)

Lambda → Code → lambda_function.py:

import json
import os
import boto3
from urllib.parse import unquote_plus

REGION = os.environ.get("REGION", "us-east-2")
BUCKET = os.environ.get("BUCKET", "")
SRC_PREFIX = os.environ.get("SRC_PREFIX", "uploads/")
DST_PREFIX = os.environ.get("DST_PREFIX", "processed/")

s3 = boto3.client("s3", region_name=REGION)

def _extract_s3_records_from_sqs_body(body_str: str):
    """
    Returns list of S3 event records (dicts).
    Handles:
      - Direct S3 event JSON in SQS body
      - SNS envelope JSON in SQS body where body["Message"] is S3 event JSON string
    """
    # Try parse JSON
    try:
        body = json.loads(body_str)
    except Exception:
        return []

    # SNS envelope?
    if isinstance(body, dict) and "Message" in body:
        try:
            body = json.loads(body["Message"])
        except Exception:
            return []

    # Now expect S3 event-like structure
    if isinstance(body, dict) and "Records" in body and isinstance(body["Records"], list):
        return body["Records"]

    return []

def _move_object(bucket: str, src_key: str, dst_key: str):
    # Copy then delete to simulate "move"
    s3.copy_object(
        Bucket=bucket,
        CopySource={"Bucket": bucket, "Key": src_key},
        Key=dst_key
    )
    s3.delete_object(Bucket=bucket, Key=src_key)

def lambda_handler(event, context):
    """
    event is from SQS trigger and contains event["Records"] (SQS records)
    """
    if not BUCKET:
        raise ValueError("BUCKET env var is required")

    processed = 0

    # For SQS-triggered Lambda, AWS deletes messages automatically ONLY if handler succeeds.
    # If we raise an exception, batch will be retried.
    for sqs_record in event.get("Records", []):
        body_str = sqs_record.get("body", "")
        s3_records = _extract_s3_records_from_sqs_body(body_str)

        if not s3_records:
            # If message isn't in expected format, just skip processing.
            # If you want to drop it, you can "succeed" (do nothing).
            # If you want retries/DLQ, raise.
            continue

        for r in s3_records:
            if r.get("eventSource") != "aws:s3":
                continue

            b = r.get("s3", {}).get("bucket", {}).get("name")
            if b and b != BUCKET:
                # If your queue may contain events from other buckets, skip them
                continue

            raw_key = r.get("s3", {}).get("object", {}).get("key", "")
            src_key = unquote_plus(raw_key)

            # Only move files from SRC_PREFIX
            if not src_key.startswith(SRC_PREFIX):
                continue

            filename = src_key[len(SRC_PREFIX):]
            if not filename:
                continue

            dst_key = f"{DST_PREFIX}{filename}"

            _move_object(BUCKET, src_key, dst_key)
            processed += 1

    return {"status": "ok", "moved": processed}

Notes:

This is “copy + delete” to move objects (standard S3 pattern).
If you want strictness (bad message should go to DLQ), change continue to raise.

6) Add SQS trigger to Lambda

Lambda → Configuration → Triggers → Add trigger

Trigger: SQS
Choose your queue (the same queue SNS sends to)
Settings:
- Batch size: start with 1–5 (simple debugging)
- Activate trigger: Enabled

Important: Lambda will now poll SQS for you.

7) Set SQS queue settings for Lambda (important production settings)

Go to SQS → your queue → Edit:

7.1 Visibility timeout

Set Visibility timeout > Lambda timeout
Example:

Lambda timeout = 30 seconds
SQS visibility timeout = 2 minutes (120 seconds)

Rule:

Visibility timeout should be ~6x Lambda timeout in many real setups.

7.2 Dead-letter queue (DLQ)

Create a DLQ (recommended):

Create another SQS queue: your-queue-dlq
Set DLQ redrive policy on main queue:
- Max receive count: 3 (or 5) If Lambda keeps failing a message, it will end up in DLQ for investigation.

8) Configure Lambda timeout and memory

Lambda → Configuration → General configuration → Edit:

Timeout: 30 seconds (start)
Memory: 256 MB (start)

If you process big files or do heavy work, increase memory.

9) Testing (end-to-end)

9.1 Upload a file into `uploads/`

From web server or locally:

echo "hello" > t1.txt
aws s3 cp t1.txt s3://YOUR_BUCKET_NAME/uploads/t1.txt --region us-east-2

9.2 Watch Lambda logs

CloudWatch → Logs → Log groups → /aws/lambda/sqs-s3-worker

You should see function invocations.

9.3 Confirm S3 move

S3 → bucket → processed/ should now contain t1.txt

10) Cutover plan (no downtime)

If you currently still run EC2 worker, do this clean cutover:

Stop EC2 worker process (so it doesn’t double-process)
Ensure Lambda trigger is enabled
Upload new files and confirm only Lambda processes them
After stable, delete old worker ASG (optional)

Because SQS buffers messages, there is no downtime risk.

Your current pipeline:

User → CloudFront → S3 → SNS → SQS → Worker

That is only the file-processing subsystem, not the entire system.

In real production, that subsystem sits inside a much bigger architecture.

1️⃣ Full Production Architecture (Realistic)

Typical modern web system looks like this:

User (Browser / Mobile App)
        ↓
CloudFront (CDN)
        ↓
ALB (Load Balancer)
        ↓
Web / API Layer (EC2 / ECS / EKS)
        ↓
Application Logic
        ↓
Database (RDS / Aurora / DynamoDB)

Now your S3/SQS worker pipeline becomes a background processing layer attached to this.

2️⃣ Where Your File Processing Fits

Here’s the realistic architecture:

                    ┌───────────────────────┐
                    │        CloudFront     │
                    └─────────────┬─────────┘
                                  ↓
                          ┌───────────────┐
                          │  ALB          │
                          └──────┬────────┘
                                 ↓
                      ┌────────────────────┐
                      │  Web/API Backend   │
                      │  (EC2/ECS/EKS)     │
                      └──────┬─────────────┘
                             ↓
                         Database
                      (RDS / Aurora)

         ───────────── File Upload Path ─────────────

User → Web/API → S3 (private, uploads/)
                         ↓
                        SNS
                         ↓
                        SQS
                         ↓
                    Worker (Lambda / ASG)
                         ↓
                S3 (processed/)
                         ↓
                    CloudFront
                         ↓
                       User

Now it makes sense.

3️⃣ Let’s Break Each Layer

Frontend

React / Angular / Next.js app
Static files stored in S3
Served via CloudFront

CloudFront handles:

CDN
TLS
Caching
WAF
Rate limiting

Backend (API Layer)

This is where:

Authentication happens
Business logic runs
API endpoints exist
DB queries happen

Usually hosted on:

EC2 + ASG
ECS
EKS (Kubernetes)
Sometimes Lambda (serverless API)

Database Layer

Common production DBs:

RDS (Postgres, MySQL)
Aurora
DynamoDB
ElastiCache (Redis)

This stores:

Users
Orders
Metadata about files
Status of processing

Important:
S3 does NOT replace database.
S3 stores files, not relational data.

Background Processing Layer (Your System)

This is where your S3 → SQS → Worker fits.

Used for:

Image resizing
Video transcoding
PDF parsing
Email sending
Report generation
AI processing
Data pipelines

This is called:

"Asynchronous processing tier"

4️⃣ Why Separate Background From Backend?

Because:

If image processing takes 5 seconds:

Without queue:
User waits 5 seconds → bad UX.

With queue:
User upload returns immediately.
Processing happens in background.

This improves:

Performance
Scalability
Stability

5️⃣ Real Company Example (E-commerce)

User uploads product image:

Browser uploads to S3 (via signed URL)
Backend stores metadata in DB
S3 triggers SNS/SQS
Worker resizes image
Processed image stored
CloudFront serves image globally

That is real architecture.

6️⃣ Where Does Frontend Sit?

Two common patterns:

Pattern A – Static Frontend

CloudFront → S3 (static React build)

API calls go to:
CloudFront → ALB → Backend

Pattern B – Dynamic App

CloudFront → ALB → Backend → HTML generated

7️⃣ Where Does Database Sit?

Backend talks to DB.

Workers may also talk to DB to update:

File processing status
Job status
Analytics

But S3/SQS does NOT replace DB.

8️⃣ Full Enterprise Diagram

Users
  ↓
CloudFront
  ↓
ALB
  ↓
API Servers (ASG / ECS / EKS)
  ↓
RDS Database
  ↓
S3 (uploads)
  ↓
SNS
  ↓
SQS
  ↓
Lambda / Worker ASG
  ↓
S3 (processed)
  ↓
CloudFront (deliver files)

That is production-ready architecture.

9️⃣ What You Built So Far

You built:

✔ Background processing tier
✔ Decoupled system
✔ Auto Scaling
✔ Self-healing
✔ Event-driven architecture

Now you only need:

API layer
Database layer
CloudFront for frontend

10️⃣ This Is How Real Companies Organize Teams

Frontend team
Backend/API team
Data/DB team
Platform/DevOps team
Background jobs team

Your worker system belongs to platform / async processing team.

🎬 High-Level Netflix-Style Architecture

🧠 Full Logical Flow (Simplified Enterprise Model)

Users (TV / Mobile / Browser)
        ↓
Route 53 (DNS)
        ↓
CloudFront (CDN)
        ↓
Global Load Balancer (ALB / NLB)
        ↓
Microservices Layer (EKS / ECS / EC2 ASG)
        ↓
Databases + Caches
        ↓
Async Systems (Kafka / SQS / EventBridge)
        ↓
Background Processing Workers
        ↓
S3 (Media, thumbnails, artifacts)

Now let’s break it down layer by layer.

1️⃣ User Layer

Users connect from:

Smart TVs
Phones
Browsers
Tablets

Requests hit:

Route 53 (DNS)

Amazon Web Services service for global DNS routing.

Netflix uses:

Geo routing
Health checks
Multi-region failover

2️⃣ Edge Layer – CloudFront

Amazon CloudFront

CloudFront:

Caches video metadata
Caches thumbnails
Reduces backend load
Protects infrastructure

For video streaming, Netflix also uses its own CDN (Open Connect), but conceptually it’s like CloudFront.

3️⃣ Load Balancing Layer

Amazon Elastic Load Balancing

ALB distributes traffic to:

Microservices
API services
Authentication services

This ensures:

High availability
Health checks
Zero downtime deployments

4️⃣ Microservices Layer (Core Backend)

Netflix runs thousands of microservices.

In AWS terms this would be:

EKS (Kubernetes)
ECS
Or EC2 with ASG

Amazon Elastic Kubernetes Service
Amazon Elastic Container Service

Each service handles:

Authentication
Recommendations
Playback service
Billing
User profiles
Content metadata

This layer is horizontally scalable.

5️⃣ Database Layer

Used for:

User data
Viewing history
Content metadata
Recommendations

Examples:

Amazon Aurora
Amazon DynamoDB
Amazon ElastiCache

Netflix heavily uses:

Cassandra (NoSQL)
Redis
Aurora-like databases

6️⃣ Asynchronous Messaging Layer

This is where your SQS/SNS idea lives.

Large systems use:

Kafka
SQS
EventBridge
Streaming pipelines

Amazon Simple Queue Service
Amazon EventBridge

Used for:

Viewing analytics
Logging events
Recommendation updates
Background tasks

This prevents synchronous overload.

7️⃣ Background Workers

Workers process:

Video encoding
Thumbnail generation
Recommendation training
Analytics pipelines

Could run on:

Lambda
Kubernetes Jobs
EC2 ASG workers

Exactly like your system.

8️⃣ Storage Layer

Amazon Simple Storage Service

Stores:

Video files
Static assets
Processed media
Logs

S3 is core storage layer.

9️⃣ Observability Layer (Critical in Enterprise)

Netflix-level systems require:

Metrics
Logs
Tracing
Alerts

AWS equivalents:

CloudWatch
X-Ray
Prometheus
Grafana

Without this, large systems collapse.

🔟 Multi-Region Architecture

Enterprise systems do NOT run in one region.

They use:

Region A (Primary)
Region B (Failover)
Global Traffic Management

If us-east-1 fails → traffic shifts to us-west-2.

This is called:
Active-Active or Active-Passive architecture.

🧩 Where Your Current Project Fits

Your system represents:

✔ Background processing layer
✔ Event-driven pipeline
✔ Scalable worker tier
✔ Auto-healing infrastructure

It is one subsystem inside enterprise design.

🎯 Key Enterprise Concepts You Should Understand

Layer	Purpose
CloudFront	Edge caching & security
ALB	Traffic distribution
EKS/ECS	Microservices
RDS/Dynamo	Data persistence
SQS/Kafka	Decoupling
Workers	Async compute
S3	Object storage
Multi-Region	Disaster recovery

🧠 Why This Matters for You

When interviewer asks:

"Explain large-scale architecture for streaming platform"

You can describe:

Edge layer
Compute layer
Data layer
Async layer
Scaling strategies
Failover strategies

That’s senior-level thinking.

🔴 1️⃣ Coupled Architecture (Tightly Coupled)

What This Means

User → Web Server → Worker → Database

Web server calls worker directly.

If worker is:

Slow ❌
Crashed ❌
Overloaded ❌

Then:

Web server fails
User request fails
System becomes unstable

Problems in Coupled Systems

No buffering
No independent scaling
Cascading failures
Hard deployments
Tight dependencies

Example:
If image processing takes 5 seconds,
User waits 5 seconds.

If worker crashes,
Web crashes too.

🟢 2️⃣ Decoupled Architecture (Loosely Coupled)

What This Means

User → Web → S3
            ↓
           SQS
            ↓
         Worker

Web does NOT call worker directly.

Instead:

Web drops a message in queue.
Worker processes independently.

Benefits

✔ Web responds immediately
✔ Worker can crash — system still works
✔ Queue buffers traffic spikes
✔ Independent scaling
✔ Fault isolation

If worker crashes:

Messages wait safely in SQS
ASG or Lambda recreates worker
Processing resumes

No user impact.

🔥 Real Production Example

Coupled:

Upload → Resize image immediately → Return response

Decoupled:

Upload → Return 200 OK
         ↓
      Resize async in background

That is enterprise-grade design.

⚖️ Simple Comparison Table

Feature	Coupled	Decoupled
Direct dependency	Yes	No
Queue buffering	No	Yes
Failure isolation	No	Yes
Horizontal scaling	Hard	Easy
Production safe	No	Yes

🧠 In Interviews

If asked:

"What is decoupling?"

Strong answer:

"Decoupling means separating services using asynchronous communication like queues so that failure or scaling of one service does not directly impact others."

🎯 Your Current Project

You built:

S3 → SNS → SQS → Worker

That is a decoupled system.

You moved from:

Web → Worker (direct)

to:

Web → Queue → Worker (indirect)

That is architectural maturity.

Why Replace EC2 Worker With Lambda?

EC2 Worker vs Lambda

EC2 Worker (What You Built)

Lambda

When Production Teams Switch to Lambda

How Lambda Autoscaling Works

Real Technical Explanation

Example Scaling Comparison

Why Big Companies Use Lambda for Workers

Your Architecture With Lambda

Does Lambda Replace EC2 Completely?

Cost Example

So When Should You Switch?

In Production Architecture

Important DevOps Skill

0) Confirm what you have now

1) Make sure your SQS message format is what Lambda will read

2) Create IAM Role for Lambda

2.1 Create a role

2.2 Attach permissions (minimum required)

3) Create the Lambda function

4) Add environment variables

5) Paste Lambda code (supports SNS-wrapped + direct S3 events)

6) Add SQS trigger to Lambda

7) Set SQS queue settings for Lambda (important production settings)

7.1 Visibility timeout

7.2 Dead-letter queue (DLQ)

8) Configure Lambda timeout and memory

9) Testing (end-to-end)

9.1 Upload a file into uploads/

9.2 Watch Lambda logs

9.3 Confirm S3 move

10) Cutover plan (no downtime)

1️⃣ Full Production Architecture (Realistic)

2️⃣ Where Your File Processing Fits

3️⃣ Let’s Break Each Layer

Frontend

Backend (API Layer)

Database Layer

Background Processing Layer (Your System)

4️⃣ Why Separate Background From Backend?

5️⃣ Real Company Example (E-commerce)

6️⃣ Where Does Frontend Sit?

Pattern A – Static Frontend

Pattern B – Dynamic App

7️⃣ Where Does Database Sit?

8️⃣ Full Enterprise Diagram

9️⃣ What You Built So Far

10️⃣ This Is How Real Companies Organize Teams

🎬 High-Level Netflix-Style Architecture

🧠 Full Logical Flow (Simplified Enterprise Model)

1️⃣ User Layer

Route 53 (DNS)

2️⃣ Edge Layer – CloudFront

3️⃣ Load Balancing Layer

4️⃣ Microservices Layer (Core Backend)

5️⃣ Database Layer

6️⃣ Asynchronous Messaging Layer

7️⃣ Background Workers

8️⃣ Storage Layer

9️⃣ Observability Layer (Critical in Enterprise)

🔟 Multi-Region Architecture

🧩 Where Your Current Project Fits

🎯 Key Enterprise Concepts You Should Understand

🧠 Why This Matters for You

🔴 1️⃣ Coupled Architecture (Tightly Coupled)

What This Means

Problems in Coupled Systems

🟢 2️⃣ Decoupled Architecture (Loosely Coupled)

What This Means

Benefits

🔥 Real Production Example

⚖️ Simple Comparison Table

🧠 In Interviews

🎯 Your Current Project

9.1 Upload a file into `uploads/`