DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

Part 3: Decoupled Architecture

  1. Prepare the existing EC2 (source instance)
  • Confirm the instance is fully configured (packages, files, worker.py, configs).
  • Verify the file you care about exists (example):

    • ls /home/ubuntu
    • cat /home/ubuntu/worker.py | head
  1. Create a custom AMI from the existing EC2
  • EC2 → Instances → select the configured instance (e.g., worker-server)
  • Actions → Image and templates → Create image
  • Name it (example: image-sqs-template)
  • Keep Reboot instance enabled (recommended)
  • Create image
  • EC2 → AMIs → wait until Status = Available
  1. Create a Launch Template using that AMI
  • EC2 → Launch templates → Create launch template
  • Choose the AMI from My AMIs: image-sqs-template (example AMI: ami-00be6270b4bced645)
  • Set instance type (e.g., t2.small / t2.micro)
  • Select key pair (e.g., key)
  • Select security group (same one you used before)
  • Advanced details → IAM instance profile: select your role (e.g., worker-ec2-role)
  • Leave User data empty (because config is already baked into AMI)
  • Click Create launch template
  1. Create the Auto Scaling Group (ASG)
  • EC2 → Auto Scaling Groups → Create Auto Scaling group
  • Name the ASG (example: asg-ana)
  • Select launch template: my-lc-template
  • Version: Latest (important)
  • Choose VPC (default VPC is fine)
  • Select subnets in 2–3 AZs (e.g., us-east-2a, 2b, 2c)
  • Load balancer: No load balancer (for worker pattern)
  • Health checks: EC2 only
  • Grace period: 60 seconds
  1. Set capacity
  • Desired = 1
  • Min = 1
  • Max = 2
  • Scaling policies: None (for lab)
  1. Create ASG and verify
  • Click Create Auto Scaling group
  • EC2 → Instances → open the new instance created by ASG
  • Confirm:

    • Auto Scaling group name shows your ASG
    • AMI ID matches your custom AMI
    • IAM role attached correctly
  • SSH to new instance and verify config/files:

    • ls /home/ubuntu → confirm worker.py exists
  1. Optional (production best practice)
  • If you want worker.py to start automatically on every new ASG instance, create:

    • a systemd service, or
    • user-data bootstrap script.

You currently have:

S3 → SNS → SQS → EC2 Worker (ASG)

Why Replace EC2 Worker With Lambda?

Your worker does this:

  • Wait for SQS message
  • Process file
  • Move file in S3

That is event-driven processing.

Lambda is designed exactly for:

Event-driven workloads.


EC2 Worker vs Lambda

EC2 Worker (What You Built)

Pros:

  • Full control
  • Long-running tasks possible
  • Custom OS-level configuration

Cons:

  • You manage servers
  • You patch OS
  • You manage scaling rules
  • You pay even when idle
  • Cold worker may exist doing nothing

Lambda

Pros:

  • No server management
  • Auto-scales automatically
  • Pay only per execution
  • Built-in SQS integration
  • No ASG needed

Cons:

  • 15-minute execution limit
  • Limited control over OS
  • Cold starts possible

When Production Teams Switch to Lambda

They switch when:

  • Workload is event-driven
  • Tasks are short-lived
  • Traffic is unpredictable
  • They want lower operational overhead

Your worker is perfect Lambda candidate.


How Lambda Autoscaling Works

With SQS trigger:

Lambda automatically:

  1. Polls SQS
  2. Creates parallel executions
  3. Scales based on queue depth

If 1 message → 1 Lambda invocation
If 1000 messages → many concurrent Lambda executions

No ASG needed.

No instance management.


Real Technical Explanation

Lambda + SQS integration:

AWS manages:

  • Polling SQS
  • Batching messages
  • Concurrency control
  • Scaling workers automatically

Scaling happens based on:

ApproximateNumberOfMessagesVisible

It increases concurrency automatically.


Example Scaling Comparison

EC2 ASG:

  • You configure scaling policy
  • Based on CPU or SQS metric
  • ASG launches new VM (takes 1–2 minutes)

Lambda:

  • No VM
  • No boot time
  • New execution in milliseconds
  • Fully managed scaling

Why Big Companies Use Lambda for Workers

Because:

  • No infrastructure management
  • No patching
  • No OS vulnerabilities
  • No idle cost
  • Perfect for background processing

For web servers → EC2 or ECS or Kubernetes
For background jobs → often Lambda


Your Architecture With Lambda

Instead of:

SQS → EC2 Worker

It becomes:

SQS → Lambda → S3

Much simpler.


Does Lambda Replace EC2 Completely?

No.

Lambda is best for:

  • Background jobs
  • Image processing
  • Notifications
  • API microservices (small)

Not good for:

  • Long-running processes
  • High-memory heavy tasks
  • Stateful services

Cost Example

If worker idle 23 hours/day:

EC2:
You pay 24 hours.

Lambda:
You pay only for execution time.


So When Should You Switch?

If:

  • Processing is short (<15 minutes)
  • Stateless
  • Event-driven
  • Spiky workload

Then Lambda is better.


In Production Architecture

Very common pattern:

Web Tier → EC2 / ALB
Background Jobs → Lambda
Streaming → Kinesis
Storage → S3


Important DevOps Skill

You should know both:

  • How to scale EC2 with ASG
  • How to build serverless with Lambda

Interviewers love when you compare both.

0) Confirm what you have now

You already have:

  • S3 bucket with folders/prefixes: uploads/, processed/
  • SNS topic wired to SQS (S3 event → SNS → SQS) or S3 event direct to SNS
  • SQS queue receiving messages
  • Worker logic that:
  1. reads SQS message
  2. gets S3 object key from event
  3. copies/moves object from uploads/ to processed/

Lambda will do the same, but AWS will poll SQS automatically.


1) Make sure your SQS message format is what Lambda will read

Your messages are likely SNS-wrapped S3 events (common pattern):

  • Lambda receives SQS record
  • record["body"] might be:

    • an SNS envelope JSON, with "Message" containing the actual S3 event JSON
    • or directly the S3 event JSON

We’ll support both formats in the Lambda code.


2) Create IAM Role for Lambda

2.1 Create a role

AWS Console → IAM → Roles → Create role

  • Trusted entity: AWS service
  • Use case: Lambda
  • Create role name: lambda-sqs-s3-worker-role

2.2 Attach permissions (minimum required)

Attach (or create a custom policy) with:

A) CloudWatch logs

  • AWSLambdaBasicExecutionRole (managed policy)

B) Read from SQS

  • Either managed: AWSLambdaSQSQueueExecutionRole (recommended)

    • or custom SQS permissions:
    • sqs:ReceiveMessage
    • sqs:DeleteMessage
    • sqs:GetQueueAttributes
    • sqs:ChangeMessageVisibility

C) Access S3 objects
Custom policy example (replace bucket name):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3ReadWriteSpecificBucket",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
      "Resource": ["arn:aws:s3:::YOUR_BUCKET_NAME/*"]
    },
    {
      "Sid": "S3ListBucketForPrefixOps",
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": ["arn:aws:s3:::YOUR_BUCKET_NAME"]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

3) Create the Lambda function

AWS Console → Lambda → Create function

  • Author from scratch
  • Name: sqs-s3-worker
  • Runtime: Python 3.12 (or 3.11)
  • Execution role: Use existing rolelambda-sqs-s3-worker-role

4) Add environment variables

Lambda → Configuration → Environment variables:

  • REGION = us-east-2
  • BUCKET = YOUR_BUCKET_NAME
  • SRC_PREFIX = uploads/
  • DST_PREFIX = processed/

(You can hardcode, but env vars are cleaner.)


5) Paste Lambda code (supports SNS-wrapped + direct S3 events)

Lambda → Code → lambda_function.py:

import json
import os
import boto3
from urllib.parse import unquote_plus

REGION = os.environ.get("REGION", "us-east-2")
BUCKET = os.environ.get("BUCKET", "")
SRC_PREFIX = os.environ.get("SRC_PREFIX", "uploads/")
DST_PREFIX = os.environ.get("DST_PREFIX", "processed/")

s3 = boto3.client("s3", region_name=REGION)

def _extract_s3_records_from_sqs_body(body_str: str):
    """
    Returns list of S3 event records (dicts).
    Handles:
      - Direct S3 event JSON in SQS body
      - SNS envelope JSON in SQS body where body["Message"] is S3 event JSON string
    """
    # Try parse JSON
    try:
        body = json.loads(body_str)
    except Exception:
        return []

    # SNS envelope?
    if isinstance(body, dict) and "Message" in body:
        try:
            body = json.loads(body["Message"])
        except Exception:
            return []

    # Now expect S3 event-like structure
    if isinstance(body, dict) and "Records" in body and isinstance(body["Records"], list):
        return body["Records"]

    return []

def _move_object(bucket: str, src_key: str, dst_key: str):
    # Copy then delete to simulate "move"
    s3.copy_object(
        Bucket=bucket,
        CopySource={"Bucket": bucket, "Key": src_key},
        Key=dst_key
    )
    s3.delete_object(Bucket=bucket, Key=src_key)

def lambda_handler(event, context):
    """
    event is from SQS trigger and contains event["Records"] (SQS records)
    """
    if not BUCKET:
        raise ValueError("BUCKET env var is required")

    processed = 0

    # For SQS-triggered Lambda, AWS deletes messages automatically ONLY if handler succeeds.
    # If we raise an exception, batch will be retried.
    for sqs_record in event.get("Records", []):
        body_str = sqs_record.get("body", "")
        s3_records = _extract_s3_records_from_sqs_body(body_str)

        if not s3_records:
            # If message isn't in expected format, just skip processing.
            # If you want to drop it, you can "succeed" (do nothing).
            # If you want retries/DLQ, raise.
            continue

        for r in s3_records:
            if r.get("eventSource") != "aws:s3":
                continue

            b = r.get("s3", {}).get("bucket", {}).get("name")
            if b and b != BUCKET:
                # If your queue may contain events from other buckets, skip them
                continue

            raw_key = r.get("s3", {}).get("object", {}).get("key", "")
            src_key = unquote_plus(raw_key)

            # Only move files from SRC_PREFIX
            if not src_key.startswith(SRC_PREFIX):
                continue

            filename = src_key[len(SRC_PREFIX):]
            if not filename:
                continue

            dst_key = f"{DST_PREFIX}{filename}"

            _move_object(BUCKET, src_key, dst_key)
            processed += 1

    return {"status": "ok", "moved": processed}
Enter fullscreen mode Exit fullscreen mode

Notes:

  • This is “copy + delete” to move objects (standard S3 pattern).
  • If you want strictness (bad message should go to DLQ), change continue to raise.

6) Add SQS trigger to Lambda

Lambda → Configuration → Triggers → Add trigger

  • Trigger: SQS
  • Choose your queue (the same queue SNS sends to)
  • Settings:

    • Batch size: start with 1–5 (simple debugging)
    • Activate trigger: Enabled

Important: Lambda will now poll SQS for you.


7) Set SQS queue settings for Lambda (important production settings)

Go to SQS → your queue → Edit:

7.1 Visibility timeout

Set Visibility timeout > Lambda timeout
Example:

  • Lambda timeout = 30 seconds
  • SQS visibility timeout = 2 minutes (120 seconds)

Rule:

  • Visibility timeout should be ~6x Lambda timeout in many real setups.

7.2 Dead-letter queue (DLQ)

Create a DLQ (recommended):

  • Create another SQS queue: your-queue-dlq
  • Set DLQ redrive policy on main queue:

    • Max receive count: 3 (or 5) If Lambda keeps failing a message, it will end up in DLQ for investigation.

8) Configure Lambda timeout and memory

Lambda → Configuration → General configuration → Edit:

  • Timeout: 30 seconds (start)
  • Memory: 256 MB (start)

If you process big files or do heavy work, increase memory.


9) Testing (end-to-end)

9.1 Upload a file into uploads/

From web server or locally:

echo "hello" > t1.txt
aws s3 cp t1.txt s3://YOUR_BUCKET_NAME/uploads/t1.txt --region us-east-2
Enter fullscreen mode Exit fullscreen mode

9.2 Watch Lambda logs

CloudWatch → Logs → Log groups → /aws/lambda/sqs-s3-worker

You should see function invocations.

9.3 Confirm S3 move

S3 → bucket → processed/ should now contain t1.txt


10) Cutover plan (no downtime)

If you currently still run EC2 worker, do this clean cutover:

  1. Stop EC2 worker process (so it doesn’t double-process)
  2. Ensure Lambda trigger is enabled
  3. Upload new files and confirm only Lambda processes them
  4. After stable, delete old worker ASG (optional)

Because SQS buffers messages, there is no downtime risk.

Your current pipeline:

User → CloudFront → S3 → SNS → SQS → Worker

That is only the file-processing subsystem, not the entire system.

In real production, that subsystem sits inside a much bigger architecture.

1️⃣ Full Production Architecture (Realistic)

Typical modern web system looks like this:

User (Browser / Mobile App)
        ↓
CloudFront (CDN)
        ↓
ALB (Load Balancer)
        ↓
Web / API Layer (EC2 / ECS / EKS)
        ↓
Application Logic
        ↓
Database (RDS / Aurora / DynamoDB)
Enter fullscreen mode Exit fullscreen mode

Now your S3/SQS worker pipeline becomes a background processing layer attached to this.


2️⃣ Where Your File Processing Fits

Here’s the realistic architecture:

                    ┌───────────────────────┐
                    │        CloudFront     │
                    └─────────────┬─────────┘
                                  ↓
                          ┌───────────────┐
                          │  ALB          │
                          └──────┬────────┘
                                 ↓
                      ┌────────────────────┐
                      │  Web/API Backend   │
                      │  (EC2/ECS/EKS)     │
                      └──────┬─────────────┘
                             ↓
                         Database
                      (RDS / Aurora)

         ───────────── File Upload Path ─────────────

User → Web/API → S3 (private, uploads/)
                         ↓
                        SNS
                         ↓
                        SQS
                         ↓
                    Worker (Lambda / ASG)
                         ↓
                S3 (processed/)
                         ↓
                    CloudFront
                         ↓
                       User
Enter fullscreen mode Exit fullscreen mode

Now it makes sense.


3️⃣ Let’s Break Each Layer

Frontend

  • React / Angular / Next.js app
  • Static files stored in S3
  • Served via CloudFront

CloudFront handles:

  • CDN
  • TLS
  • Caching
  • WAF
  • Rate limiting

Backend (API Layer)

This is where:

  • Authentication happens
  • Business logic runs
  • API endpoints exist
  • DB queries happen

Usually hosted on:

  • EC2 + ASG
  • ECS
  • EKS (Kubernetes)
  • Sometimes Lambda (serverless API)

Database Layer

Common production DBs:

  • RDS (Postgres, MySQL)
  • Aurora
  • DynamoDB
  • ElastiCache (Redis)

This stores:

  • Users
  • Orders
  • Metadata about files
  • Status of processing

Important:
S3 does NOT replace database.
S3 stores files, not relational data.


Background Processing Layer (Your System)

This is where your S3 → SQS → Worker fits.

Used for:

  • Image resizing
  • Video transcoding
  • PDF parsing
  • Email sending
  • Report generation
  • AI processing
  • Data pipelines

This is called:

"Asynchronous processing tier"


4️⃣ Why Separate Background From Backend?

Because:

If image processing takes 5 seconds:

Without queue:
User waits 5 seconds → bad UX.

With queue:
User upload returns immediately.
Processing happens in background.

This improves:

  • Performance
  • Scalability
  • Stability

5️⃣ Real Company Example (E-commerce)

User uploads product image:

  1. Browser uploads to S3 (via signed URL)
  2. Backend stores metadata in DB
  3. S3 triggers SNS/SQS
  4. Worker resizes image
  5. Processed image stored
  6. CloudFront serves image globally

That is real architecture.


6️⃣ Where Does Frontend Sit?

Two common patterns:

Pattern A – Static Frontend

CloudFront → S3 (static React build)

API calls go to:
CloudFront → ALB → Backend

Pattern B – Dynamic App

CloudFront → ALB → Backend → HTML generated


7️⃣ Where Does Database Sit?

Backend talks to DB.

Workers may also talk to DB to update:

  • File processing status
  • Job status
  • Analytics

But S3/SQS does NOT replace DB.


8️⃣ Full Enterprise Diagram

Users
  ↓
CloudFront
  ↓
ALB
  ↓
API Servers (ASG / ECS / EKS)
  ↓
RDS Database
  ↓
S3 (uploads)
  ↓
SNS
  ↓
SQS
  ↓
Lambda / Worker ASG
  ↓
S3 (processed)
  ↓
CloudFront (deliver files)
Enter fullscreen mode Exit fullscreen mode

That is production-ready architecture.


9️⃣ What You Built So Far

You built:

✔ Background processing tier
✔ Decoupled system
✔ Auto Scaling
✔ Self-healing
✔ Event-driven architecture

Now you only need:

  • API layer
  • Database layer
  • CloudFront for frontend

10️⃣ This Is How Real Companies Organize Teams

  • Frontend team
  • Backend/API team
  • Data/DB team
  • Platform/DevOps team
  • Background jobs team

Your worker system belongs to platform / async processing team.

🎬 High-Level Netflix-Style Architecture

Image

Image

Image

Image


🧠 Full Logical Flow (Simplified Enterprise Model)

Users (TV / Mobile / Browser)
        ↓
Route 53 (DNS)
        ↓
CloudFront (CDN)
        ↓
Global Load Balancer (ALB / NLB)
        ↓
Microservices Layer (EKS / ECS / EC2 ASG)
        ↓
Databases + Caches
        ↓
Async Systems (Kafka / SQS / EventBridge)
        ↓
Background Processing Workers
        ↓
S3 (Media, thumbnails, artifacts)
Enter fullscreen mode Exit fullscreen mode

Now let’s break it down layer by layer.


1️⃣ User Layer

Users connect from:

  • Smart TVs
  • Phones
  • Browsers
  • Tablets

Requests hit:

Route 53 (DNS)

Amazon Web Services service for global DNS routing.

Netflix uses:

  • Geo routing
  • Health checks
  • Multi-region failover

2️⃣ Edge Layer – CloudFront

Amazon CloudFront

CloudFront:

  • Caches video metadata
  • Caches thumbnails
  • Reduces backend load
  • Protects infrastructure

For video streaming, Netflix also uses its own CDN (Open Connect), but conceptually it’s like CloudFront.


3️⃣ Load Balancing Layer

Amazon Elastic Load Balancing

ALB distributes traffic to:

  • Microservices
  • API services
  • Authentication services

This ensures:

  • High availability
  • Health checks
  • Zero downtime deployments

4️⃣ Microservices Layer (Core Backend)

Netflix runs thousands of microservices.

In AWS terms this would be:

  • EKS (Kubernetes)
  • ECS
  • Or EC2 with ASG

Amazon Elastic Kubernetes Service
Amazon Elastic Container Service

Each service handles:

  • Authentication
  • Recommendations
  • Playback service
  • Billing
  • User profiles
  • Content metadata

This layer is horizontally scalable.


5️⃣ Database Layer

Used for:

  • User data
  • Viewing history
  • Content metadata
  • Recommendations

Examples:

Amazon Aurora
Amazon DynamoDB
Amazon ElastiCache

Netflix heavily uses:

  • Cassandra (NoSQL)
  • Redis
  • Aurora-like databases

6️⃣ Asynchronous Messaging Layer

This is where your SQS/SNS idea lives.

Large systems use:

  • Kafka
  • SQS
  • EventBridge
  • Streaming pipelines

Amazon Simple Queue Service
Amazon EventBridge

Used for:

  • Viewing analytics
  • Logging events
  • Recommendation updates
  • Background tasks

This prevents synchronous overload.


7️⃣ Background Workers

Workers process:

  • Video encoding
  • Thumbnail generation
  • Recommendation training
  • Analytics pipelines

Could run on:

  • Lambda
  • Kubernetes Jobs
  • EC2 ASG workers

Exactly like your system.


8️⃣ Storage Layer

Amazon Simple Storage Service

Stores:

  • Video files
  • Static assets
  • Processed media
  • Logs

S3 is core storage layer.


9️⃣ Observability Layer (Critical in Enterprise)

Netflix-level systems require:

  • Metrics
  • Logs
  • Tracing
  • Alerts

AWS equivalents:

  • CloudWatch
  • X-Ray
  • Prometheus
  • Grafana

Without this, large systems collapse.


🔟 Multi-Region Architecture

Enterprise systems do NOT run in one region.

They use:

Region A (Primary)
Region B (Failover)
Global Traffic Management
Enter fullscreen mode Exit fullscreen mode

If us-east-1 fails → traffic shifts to us-west-2.

This is called:
Active-Active or Active-Passive architecture.


🧩 Where Your Current Project Fits

Your system represents:

✔ Background processing layer
✔ Event-driven pipeline
✔ Scalable worker tier
✔ Auto-healing infrastructure

It is one subsystem inside enterprise design.


🎯 Key Enterprise Concepts You Should Understand

Layer Purpose
CloudFront Edge caching & security
ALB Traffic distribution
EKS/ECS Microservices
RDS/Dynamo Data persistence
SQS/Kafka Decoupling
Workers Async compute
S3 Object storage
Multi-Region Disaster recovery

🧠 Why This Matters for You

When interviewer asks:

"Explain large-scale architecture for streaming platform"

You can describe:

  • Edge layer
  • Compute layer
  • Data layer
  • Async layer
  • Scaling strategies
  • Failover strategies

That’s senior-level thinking.

🔴 1️⃣ Coupled Architecture (Tightly Coupled)

Image

Image

Image

Image

What This Means

User → Web Server → Worker → Database
Enter fullscreen mode Exit fullscreen mode

Web server calls worker directly.

If worker is:

  • Slow ❌
  • Crashed ❌
  • Overloaded ❌

Then:

  • Web server fails
  • User request fails
  • System becomes unstable

Problems in Coupled Systems

  • No buffering
  • No independent scaling
  • Cascading failures
  • Hard deployments
  • Tight dependencies

Example:
If image processing takes 5 seconds,
User waits 5 seconds.

If worker crashes,
Web crashes too.


🟢 2️⃣ Decoupled Architecture (Loosely Coupled)

Image

Image

Image

Image

What This Means

User → Web → S3
            ↓
           SQS
            ↓
         Worker
Enter fullscreen mode Exit fullscreen mode

Web does NOT call worker directly.

Instead:

  • Web drops a message in queue.
  • Worker processes independently.

Benefits

✔ Web responds immediately
✔ Worker can crash — system still works
✔ Queue buffers traffic spikes
✔ Independent scaling
✔ Fault isolation

If worker crashes:

  • Messages wait safely in SQS
  • ASG or Lambda recreates worker
  • Processing resumes

No user impact.


🔥 Real Production Example

Coupled:

Upload → Resize image immediately → Return response
Enter fullscreen mode Exit fullscreen mode

Decoupled:

Upload → Return 200 OK
         ↓
      Resize async in background
Enter fullscreen mode Exit fullscreen mode

That is enterprise-grade design.


⚖️ Simple Comparison Table

Feature Coupled Decoupled
Direct dependency Yes No
Queue buffering No Yes
Failure isolation No Yes
Horizontal scaling Hard Easy
Production safe No Yes

🧠 In Interviews

If asked:

"What is decoupling?"

Strong answer:

"Decoupling means separating services using asynchronous communication like queues so that failure or scaling of one service does not directly impact others."


🎯 Your Current Project

You built:

S3 → SNS → SQS → Worker

That is a decoupled system.

You moved from:

Web → Worker (direct)

to:

Web → Queue → Worker (indirect)

That is architectural maturity.

Top comments (0)