- Prepare the existing EC2 (source instance)
- Confirm the instance is fully configured (packages, files,
worker.py, configs). -
Verify the file you care about exists (example):
ls /home/ubuntucat /home/ubuntu/worker.py | head
- Create a custom AMI from the existing EC2
- EC2 → Instances → select the configured instance (e.g.,
worker-server) - Actions → Image and templates → Create image
- Name it (example:
image-sqs-template) - Keep Reboot instance enabled (recommended)
- Create image
- EC2 → AMIs → wait until Status = Available
- Create a Launch Template using that AMI
- EC2 → Launch templates → Create launch template
- Choose the AMI from My AMIs:
image-sqs-template(example AMI:ami-00be6270b4bced645) - Set instance type (e.g.,
t2.small/t2.micro) - Select key pair (e.g.,
key) - Select security group (same one you used before)
-
Advanced details → IAM instance profile: select your role (e.g.,
worker-ec2-role) - Leave User data empty (because config is already baked into AMI)
- Click Create launch template
- Create the Auto Scaling Group (ASG)
- EC2 → Auto Scaling Groups → Create Auto Scaling group
- Name the ASG (example:
asg-ana) - Select launch template:
my-lc-template - Version: Latest (important)
- Choose VPC (default VPC is fine)
- Select subnets in 2–3 AZs (e.g., us-east-2a, 2b, 2c)
- Load balancer: No load balancer (for worker pattern)
- Health checks: EC2 only
- Grace period: 60 seconds
- Set capacity
- Desired = 1
- Min = 1
- Max = 2
- Scaling policies: None (for lab)
- Create ASG and verify
- Click Create Auto Scaling group
- EC2 → Instances → open the new instance created by ASG
-
Confirm:
- Auto Scaling group name shows your ASG
- AMI ID matches your custom AMI
- IAM role attached correctly
-
SSH to new instance and verify config/files:
-
ls /home/ubuntu→ confirmworker.pyexists
-
- Optional (production best practice)
-
If you want
worker.pyto start automatically on every new ASG instance, create:- a systemd service, or
- user-data bootstrap script.
You currently have:
S3 → SNS → SQS → EC2 Worker (ASG)
Why Replace EC2 Worker With Lambda?
Your worker does this:
- Wait for SQS message
- Process file
- Move file in S3
That is event-driven processing.
Lambda is designed exactly for:
Event-driven workloads.
EC2 Worker vs Lambda
EC2 Worker (What You Built)
Pros:
- Full control
- Long-running tasks possible
- Custom OS-level configuration
Cons:
- You manage servers
- You patch OS
- You manage scaling rules
- You pay even when idle
- Cold worker may exist doing nothing
Lambda
Pros:
- No server management
- Auto-scales automatically
- Pay only per execution
- Built-in SQS integration
- No ASG needed
Cons:
- 15-minute execution limit
- Limited control over OS
- Cold starts possible
When Production Teams Switch to Lambda
They switch when:
- Workload is event-driven
- Tasks are short-lived
- Traffic is unpredictable
- They want lower operational overhead
Your worker is perfect Lambda candidate.
How Lambda Autoscaling Works
With SQS trigger:
Lambda automatically:
- Polls SQS
- Creates parallel executions
- Scales based on queue depth
If 1 message → 1 Lambda invocation
If 1000 messages → many concurrent Lambda executions
No ASG needed.
No instance management.
Real Technical Explanation
Lambda + SQS integration:
AWS manages:
- Polling SQS
- Batching messages
- Concurrency control
- Scaling workers automatically
Scaling happens based on:
ApproximateNumberOfMessagesVisible
It increases concurrency automatically.
Example Scaling Comparison
EC2 ASG:
- You configure scaling policy
- Based on CPU or SQS metric
- ASG launches new VM (takes 1–2 minutes)
Lambda:
- No VM
- No boot time
- New execution in milliseconds
- Fully managed scaling
Why Big Companies Use Lambda for Workers
Because:
- No infrastructure management
- No patching
- No OS vulnerabilities
- No idle cost
- Perfect for background processing
For web servers → EC2 or ECS or Kubernetes
For background jobs → often Lambda
Your Architecture With Lambda
Instead of:
SQS → EC2 Worker
It becomes:
SQS → Lambda → S3
Much simpler.
Does Lambda Replace EC2 Completely?
No.
Lambda is best for:
- Background jobs
- Image processing
- Notifications
- API microservices (small)
Not good for:
- Long-running processes
- High-memory heavy tasks
- Stateful services
Cost Example
If worker idle 23 hours/day:
EC2:
You pay 24 hours.
Lambda:
You pay only for execution time.
So When Should You Switch?
If:
- Processing is short (<15 minutes)
- Stateless
- Event-driven
- Spiky workload
Then Lambda is better.
In Production Architecture
Very common pattern:
Web Tier → EC2 / ALB
Background Jobs → Lambda
Streaming → Kinesis
Storage → S3
Important DevOps Skill
You should know both:
- How to scale EC2 with ASG
- How to build serverless with Lambda
Interviewers love when you compare both.
0) Confirm what you have now
You already have:
- S3 bucket with folders/prefixes:
uploads/,processed/ - SNS topic wired to SQS (S3 event → SNS → SQS) or S3 event direct to SNS
- SQS queue receiving messages
- Worker logic that:
- reads SQS message
- gets S3 object key from event
- copies/moves object from
uploads/toprocessed/
Lambda will do the same, but AWS will poll SQS automatically.
1) Make sure your SQS message format is what Lambda will read
Your messages are likely SNS-wrapped S3 events (common pattern):
- Lambda receives SQS record
-
record["body"]might be:- an SNS envelope JSON, with
"Message"containing the actual S3 event JSON - or directly the S3 event JSON
- an SNS envelope JSON, with
We’ll support both formats in the Lambda code.
2) Create IAM Role for Lambda
2.1 Create a role
AWS Console → IAM → Roles → Create role
- Trusted entity: AWS service
- Use case: Lambda
- Create role name:
lambda-sqs-s3-worker-role
2.2 Attach permissions (minimum required)
Attach (or create a custom policy) with:
A) CloudWatch logs
-
AWSLambdaBasicExecutionRole(managed policy)
B) Read from SQS
-
Either managed:
AWSLambdaSQSQueueExecutionRole(recommended)- or custom SQS permissions:
sqs:ReceiveMessagesqs:DeleteMessagesqs:GetQueueAttributessqs:ChangeMessageVisibility
C) Access S3 objects
Custom policy example (replace bucket name):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3ReadWriteSpecificBucket",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
"Resource": ["arn:aws:s3:::YOUR_BUCKET_NAME/*"]
},
{
"Sid": "S3ListBucketForPrefixOps",
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": ["arn:aws:s3:::YOUR_BUCKET_NAME"]
}
]
}
3) Create the Lambda function
AWS Console → Lambda → Create function
- Author from scratch
- Name:
sqs-s3-worker - Runtime: Python 3.12 (or 3.11)
- Execution role: Use existing role →
lambda-sqs-s3-worker-role
4) Add environment variables
Lambda → Configuration → Environment variables:
-
REGION=us-east-2 -
BUCKET=YOUR_BUCKET_NAME -
SRC_PREFIX=uploads/ -
DST_PREFIX=processed/
(You can hardcode, but env vars are cleaner.)
5) Paste Lambda code (supports SNS-wrapped + direct S3 events)
Lambda → Code → lambda_function.py:
import json
import os
import boto3
from urllib.parse import unquote_plus
REGION = os.environ.get("REGION", "us-east-2")
BUCKET = os.environ.get("BUCKET", "")
SRC_PREFIX = os.environ.get("SRC_PREFIX", "uploads/")
DST_PREFIX = os.environ.get("DST_PREFIX", "processed/")
s3 = boto3.client("s3", region_name=REGION)
def _extract_s3_records_from_sqs_body(body_str: str):
"""
Returns list of S3 event records (dicts).
Handles:
- Direct S3 event JSON in SQS body
- SNS envelope JSON in SQS body where body["Message"] is S3 event JSON string
"""
# Try parse JSON
try:
body = json.loads(body_str)
except Exception:
return []
# SNS envelope?
if isinstance(body, dict) and "Message" in body:
try:
body = json.loads(body["Message"])
except Exception:
return []
# Now expect S3 event-like structure
if isinstance(body, dict) and "Records" in body and isinstance(body["Records"], list):
return body["Records"]
return []
def _move_object(bucket: str, src_key: str, dst_key: str):
# Copy then delete to simulate "move"
s3.copy_object(
Bucket=bucket,
CopySource={"Bucket": bucket, "Key": src_key},
Key=dst_key
)
s3.delete_object(Bucket=bucket, Key=src_key)
def lambda_handler(event, context):
"""
event is from SQS trigger and contains event["Records"] (SQS records)
"""
if not BUCKET:
raise ValueError("BUCKET env var is required")
processed = 0
# For SQS-triggered Lambda, AWS deletes messages automatically ONLY if handler succeeds.
# If we raise an exception, batch will be retried.
for sqs_record in event.get("Records", []):
body_str = sqs_record.get("body", "")
s3_records = _extract_s3_records_from_sqs_body(body_str)
if not s3_records:
# If message isn't in expected format, just skip processing.
# If you want to drop it, you can "succeed" (do nothing).
# If you want retries/DLQ, raise.
continue
for r in s3_records:
if r.get("eventSource") != "aws:s3":
continue
b = r.get("s3", {}).get("bucket", {}).get("name")
if b and b != BUCKET:
# If your queue may contain events from other buckets, skip them
continue
raw_key = r.get("s3", {}).get("object", {}).get("key", "")
src_key = unquote_plus(raw_key)
# Only move files from SRC_PREFIX
if not src_key.startswith(SRC_PREFIX):
continue
filename = src_key[len(SRC_PREFIX):]
if not filename:
continue
dst_key = f"{DST_PREFIX}{filename}"
_move_object(BUCKET, src_key, dst_key)
processed += 1
return {"status": "ok", "moved": processed}
Notes:
- This is “copy + delete” to move objects (standard S3 pattern).
- If you want strictness (bad message should go to DLQ), change
continuetoraise.
6) Add SQS trigger to Lambda
Lambda → Configuration → Triggers → Add trigger
- Trigger: SQS
- Choose your queue (the same queue SNS sends to)
-
Settings:
- Batch size: start with 1–5 (simple debugging)
- Activate trigger: Enabled
Important: Lambda will now poll SQS for you.
7) Set SQS queue settings for Lambda (important production settings)
Go to SQS → your queue → Edit:
7.1 Visibility timeout
Set Visibility timeout > Lambda timeout
Example:
- Lambda timeout = 30 seconds
- SQS visibility timeout = 2 minutes (120 seconds)
Rule:
- Visibility timeout should be ~6x Lambda timeout in many real setups.
7.2 Dead-letter queue (DLQ)
Create a DLQ (recommended):
- Create another SQS queue:
your-queue-dlq -
Set DLQ redrive policy on main queue:
- Max receive count: 3 (or 5) If Lambda keeps failing a message, it will end up in DLQ for investigation.
8) Configure Lambda timeout and memory
Lambda → Configuration → General configuration → Edit:
- Timeout: 30 seconds (start)
- Memory: 256 MB (start)
If you process big files or do heavy work, increase memory.
9) Testing (end-to-end)
9.1 Upload a file into uploads/
From web server or locally:
echo "hello" > t1.txt
aws s3 cp t1.txt s3://YOUR_BUCKET_NAME/uploads/t1.txt --region us-east-2
9.2 Watch Lambda logs
CloudWatch → Logs → Log groups → /aws/lambda/sqs-s3-worker
You should see function invocations.
9.3 Confirm S3 move
S3 → bucket → processed/ should now contain t1.txt
10) Cutover plan (no downtime)
If you currently still run EC2 worker, do this clean cutover:
- Stop EC2 worker process (so it doesn’t double-process)
- Ensure Lambda trigger is enabled
- Upload new files and confirm only Lambda processes them
- After stable, delete old worker ASG (optional)
Because SQS buffers messages, there is no downtime risk.
Your current pipeline:
User → CloudFront → S3 → SNS → SQS → Worker
That is only the file-processing subsystem, not the entire system.
In real production, that subsystem sits inside a much bigger architecture.
1️⃣ Full Production Architecture (Realistic)
Typical modern web system looks like this:
User (Browser / Mobile App)
↓
CloudFront (CDN)
↓
ALB (Load Balancer)
↓
Web / API Layer (EC2 / ECS / EKS)
↓
Application Logic
↓
Database (RDS / Aurora / DynamoDB)
Now your S3/SQS worker pipeline becomes a background processing layer attached to this.
2️⃣ Where Your File Processing Fits
Here’s the realistic architecture:
┌───────────────────────┐
│ CloudFront │
└─────────────┬─────────┘
↓
┌───────────────┐
│ ALB │
└──────┬────────┘
↓
┌────────────────────┐
│ Web/API Backend │
│ (EC2/ECS/EKS) │
└──────┬─────────────┘
↓
Database
(RDS / Aurora)
───────────── File Upload Path ─────────────
User → Web/API → S3 (private, uploads/)
↓
SNS
↓
SQS
↓
Worker (Lambda / ASG)
↓
S3 (processed/)
↓
CloudFront
↓
User
Now it makes sense.
3️⃣ Let’s Break Each Layer
Frontend
- React / Angular / Next.js app
- Static files stored in S3
- Served via CloudFront
CloudFront handles:
- CDN
- TLS
- Caching
- WAF
- Rate limiting
Backend (API Layer)
This is where:
- Authentication happens
- Business logic runs
- API endpoints exist
- DB queries happen
Usually hosted on:
- EC2 + ASG
- ECS
- EKS (Kubernetes)
- Sometimes Lambda (serverless API)
Database Layer
Common production DBs:
- RDS (Postgres, MySQL)
- Aurora
- DynamoDB
- ElastiCache (Redis)
This stores:
- Users
- Orders
- Metadata about files
- Status of processing
Important:
S3 does NOT replace database.
S3 stores files, not relational data.
Background Processing Layer (Your System)
This is where your S3 → SQS → Worker fits.
Used for:
- Image resizing
- Video transcoding
- PDF parsing
- Email sending
- Report generation
- AI processing
- Data pipelines
This is called:
"Asynchronous processing tier"
4️⃣ Why Separate Background From Backend?
Because:
If image processing takes 5 seconds:
Without queue:
User waits 5 seconds → bad UX.
With queue:
User upload returns immediately.
Processing happens in background.
This improves:
- Performance
- Scalability
- Stability
5️⃣ Real Company Example (E-commerce)
User uploads product image:
- Browser uploads to S3 (via signed URL)
- Backend stores metadata in DB
- S3 triggers SNS/SQS
- Worker resizes image
- Processed image stored
- CloudFront serves image globally
That is real architecture.
6️⃣ Where Does Frontend Sit?
Two common patterns:
Pattern A – Static Frontend
CloudFront → S3 (static React build)
API calls go to:
CloudFront → ALB → Backend
Pattern B – Dynamic App
CloudFront → ALB → Backend → HTML generated
7️⃣ Where Does Database Sit?
Backend talks to DB.
Workers may also talk to DB to update:
- File processing status
- Job status
- Analytics
But S3/SQS does NOT replace DB.
8️⃣ Full Enterprise Diagram
Users
↓
CloudFront
↓
ALB
↓
API Servers (ASG / ECS / EKS)
↓
RDS Database
↓
S3 (uploads)
↓
SNS
↓
SQS
↓
Lambda / Worker ASG
↓
S3 (processed)
↓
CloudFront (deliver files)
That is production-ready architecture.
9️⃣ What You Built So Far
You built:
✔ Background processing tier
✔ Decoupled system
✔ Auto Scaling
✔ Self-healing
✔ Event-driven architecture
Now you only need:
- API layer
- Database layer
- CloudFront for frontend
10️⃣ This Is How Real Companies Organize Teams
- Frontend team
- Backend/API team
- Data/DB team
- Platform/DevOps team
- Background jobs team
Your worker system belongs to platform / async processing team.
🎬 High-Level Netflix-Style Architecture
🧠 Full Logical Flow (Simplified Enterprise Model)
Users (TV / Mobile / Browser)
↓
Route 53 (DNS)
↓
CloudFront (CDN)
↓
Global Load Balancer (ALB / NLB)
↓
Microservices Layer (EKS / ECS / EC2 ASG)
↓
Databases + Caches
↓
Async Systems (Kafka / SQS / EventBridge)
↓
Background Processing Workers
↓
S3 (Media, thumbnails, artifacts)
Now let’s break it down layer by layer.
1️⃣ User Layer
Users connect from:
- Smart TVs
- Phones
- Browsers
- Tablets
Requests hit:
Route 53 (DNS)
Amazon Web Services service for global DNS routing.
Netflix uses:
- Geo routing
- Health checks
- Multi-region failover
2️⃣ Edge Layer – CloudFront
Amazon CloudFront
CloudFront:
- Caches video metadata
- Caches thumbnails
- Reduces backend load
- Protects infrastructure
For video streaming, Netflix also uses its own CDN (Open Connect), but conceptually it’s like CloudFront.
3️⃣ Load Balancing Layer
Amazon Elastic Load Balancing
ALB distributes traffic to:
- Microservices
- API services
- Authentication services
This ensures:
- High availability
- Health checks
- Zero downtime deployments
4️⃣ Microservices Layer (Core Backend)
Netflix runs thousands of microservices.
In AWS terms this would be:
- EKS (Kubernetes)
- ECS
- Or EC2 with ASG
Amazon Elastic Kubernetes Service
Amazon Elastic Container Service
Each service handles:
- Authentication
- Recommendations
- Playback service
- Billing
- User profiles
- Content metadata
This layer is horizontally scalable.
5️⃣ Database Layer
Used for:
- User data
- Viewing history
- Content metadata
- Recommendations
Examples:
Amazon Aurora
Amazon DynamoDB
Amazon ElastiCache
Netflix heavily uses:
- Cassandra (NoSQL)
- Redis
- Aurora-like databases
6️⃣ Asynchronous Messaging Layer
This is where your SQS/SNS idea lives.
Large systems use:
- Kafka
- SQS
- EventBridge
- Streaming pipelines
Amazon Simple Queue Service
Amazon EventBridge
Used for:
- Viewing analytics
- Logging events
- Recommendation updates
- Background tasks
This prevents synchronous overload.
7️⃣ Background Workers
Workers process:
- Video encoding
- Thumbnail generation
- Recommendation training
- Analytics pipelines
Could run on:
- Lambda
- Kubernetes Jobs
- EC2 ASG workers
Exactly like your system.
8️⃣ Storage Layer
Amazon Simple Storage Service
Stores:
- Video files
- Static assets
- Processed media
- Logs
S3 is core storage layer.
9️⃣ Observability Layer (Critical in Enterprise)
Netflix-level systems require:
- Metrics
- Logs
- Tracing
- Alerts
AWS equivalents:
- CloudWatch
- X-Ray
- Prometheus
- Grafana
Without this, large systems collapse.
🔟 Multi-Region Architecture
Enterprise systems do NOT run in one region.
They use:
Region A (Primary)
Region B (Failover)
Global Traffic Management
If us-east-1 fails → traffic shifts to us-west-2.
This is called:
Active-Active or Active-Passive architecture.
🧩 Where Your Current Project Fits
Your system represents:
✔ Background processing layer
✔ Event-driven pipeline
✔ Scalable worker tier
✔ Auto-healing infrastructure
It is one subsystem inside enterprise design.
🎯 Key Enterprise Concepts You Should Understand
| Layer | Purpose |
|---|---|
| CloudFront | Edge caching & security |
| ALB | Traffic distribution |
| EKS/ECS | Microservices |
| RDS/Dynamo | Data persistence |
| SQS/Kafka | Decoupling |
| Workers | Async compute |
| S3 | Object storage |
| Multi-Region | Disaster recovery |
🧠 Why This Matters for You
When interviewer asks:
"Explain large-scale architecture for streaming platform"
You can describe:
- Edge layer
- Compute layer
- Data layer
- Async layer
- Scaling strategies
- Failover strategies
That’s senior-level thinking.
🔴 1️⃣ Coupled Architecture (Tightly Coupled)
What This Means
User → Web Server → Worker → Database
Web server calls worker directly.
If worker is:
- Slow ❌
- Crashed ❌
- Overloaded ❌
Then:
- Web server fails
- User request fails
- System becomes unstable
Problems in Coupled Systems
- No buffering
- No independent scaling
- Cascading failures
- Hard deployments
- Tight dependencies
Example:
If image processing takes 5 seconds,
User waits 5 seconds.
If worker crashes,
Web crashes too.
🟢 2️⃣ Decoupled Architecture (Loosely Coupled)
What This Means
User → Web → S3
↓
SQS
↓
Worker
Web does NOT call worker directly.
Instead:
- Web drops a message in queue.
- Worker processes independently.
Benefits
✔ Web responds immediately
✔ Worker can crash — system still works
✔ Queue buffers traffic spikes
✔ Independent scaling
✔ Fault isolation
If worker crashes:
- Messages wait safely in SQS
- ASG or Lambda recreates worker
- Processing resumes
No user impact.
🔥 Real Production Example
Coupled:
Upload → Resize image immediately → Return response
Decoupled:
Upload → Return 200 OK
↓
Resize async in background
That is enterprise-grade design.
⚖️ Simple Comparison Table
| Feature | Coupled | Decoupled |
|---|---|---|
| Direct dependency | Yes | No |
| Queue buffering | No | Yes |
| Failure isolation | No | Yes |
| Horizontal scaling | Hard | Easy |
| Production safe | No | Yes |
🧠 In Interviews
If asked:
"What is decoupling?"
Strong answer:
"Decoupling means separating services using asynchronous communication like queues so that failure or scaling of one service does not directly impact others."
🎯 Your Current Project
You built:
S3 → SNS → SQS → Worker
That is a decoupled system.
You moved from:
Web → Worker (direct)
to:
Web → Queue → Worker (indirect)
That is architectural maturity.











Top comments (0)