Part 1 — Introduction
This lecture explains:
- What cloud infrastructure really is
- Difference between EC2, ECS, Fargate, Kubernetes, Load Balancer
- Why companies use cloud
- Why SRE and DevOps engineers exist
- What observability means
- Difference between metrics and logs
- Why we use Prometheus, Grafana, Loki, Alloy, and Node Exporter
- Real production architecture
- Real troubleshooting scenarios
- ECS/Fargate deployment flow
- Why sidecars exist
- Why modern systems use centralized observability
This lecture is based on real troubleshooting and deployment scenarios.
Part 2 — What Is Really Happening In The Cloud?
Many beginners think:
"AWS is just hosting my website."
But in reality AWS provides:
- Physical data centers
- Servers
- Networking
- Internet routing
- Storage
- Virtualization
- Hypervisors
- Security
- Scaling infrastructure
- High availability
- Global infrastructure
When you deploy an application to AWS:
You are renting compute resources from AWS.
Part 3 — Server vs Virtual Machine
Physical Server
A physical server is:
- Real hardware
- CPU
- RAM
- Storage
- Network cards
- Power supply
Inside AWS data centers there are thousands of physical servers.
Virtual Machine (VM)
A virtual machine is:
- Software-defined computer
- Runs on top of physical server
- Has virtual CPU
- Virtual RAM
- Virtual storage
One physical server can run many virtual machines.
Part 4 — Why Cloud Exists
Without cloud:
You would need:
- Your own server room
- Cooling
- Electricity
- Networking
- Internet provider
- Firewalls
- Routers
- Hardware replacement
- OS patching
- Security
- Scaling
Cloud providers solve this.
Part 5 — Docker
Docker solves:
"How do we package applications consistently?"
Docker container includes:
- Application
- Libraries
- Dependencies
- Runtime
- Configuration
Container can run consistently:
- laptop
- EC2
- ECS
- Kubernetes
- cloud
Part 6 — Why Docker Alone Is Not Enough
If you run only one container:
Docker alone may be enough.
But large systems need:
- scaling
- failover
- networking
- deployment automation
- service discovery
- self-healing
- orchestration
This is why Kubernetes and ECS exist.
Part 7 — ECS vs Kubernetes
ECS
AWS-native container orchestrator.
Simpler.
Good integration with AWS.
Kubernetes
Industry-standard orchestration platform.
More powerful.
More complex.
Used heavily by large enterprises.
Part 8 — What Is ECS?
ECS = Elastic Container Service.
ECS does:
- Runs containers
- Restarts failed containers
- Deploys applications
- Scales containers
- Handles networking
- Manages task lifecycle
ECS is NOT:
- Load balancer
- Database
- Monitoring system
Part 9 — What Is Fargate?
Fargate is serverless container infrastructure.
AWS manages:
- servers
- patching
- hypervisor
- scaling infrastructure
- hardware
- OS maintenance
You manage only:
- containers
- task definitions
- services
Part 10 — ECS Cluster Does NOT Mean One Machine
Important concept.
ECS cluster is:
Logical grouping for tasks/services.
One ECS cluster can run:
- many tasks
- many services
- many applications
Example:
- app-service
- grafana-service
- prometheus-service
- loki-service
- alloy-service
all inside ONE ECS cluster.
Part 11 — ECS Task Definition
Task definition describes:
- container image
- CPU
- memory
- ports
- environment variables
- IAM roles
- commands
- networking
Task definition is like:
Blueprint/template for containers.
Part 12 — ECS Service
Service keeps tasks alive.
If container crashes:
ECS service recreates it automatically.
This is real production behavior.
Part 13 — Load Balancer vs ECS
Many beginners confuse this.
Load Balancer (ALB)
ALB ONLY distributes traffic.
ALB does NOT:
- run applications
- restart containers
- deploy apps
ALB routes traffic.
ECS
ECS actually:
- launches containers
- restarts containers
- scales tasks
- deploys revisions
Part 14 — Real Production Architecture
Simple Architecture
Users
↓
Public IP
↓
EC2
↓
Docker container
Good for small applications.
Better Production Architecture
Users
↓
CloudFront CDN
↓
ALB
↓
ECS Fargate
↓
Containers
Part 15 — Why CloudFront Exists
CloudFront is CDN.
It distributes content globally.
Without CDN:
All users hit one region.
Example:
Only us-east-1.
Users in Asia experience latency.
CloudFront caches closer to users.
Part 16 — What Is SRE?
SRE = Site Reliability Engineering.
SRE focuses on:
- uptime
- monitoring
- reliability
- scaling
- observability
- alerting
- automation
- troubleshooting
Part 17 — What Is Observability?
Observability means:
Understanding system behavior.
Three major pillars:
- Metrics
- Logs
- Traces
Part 18 — Metrics vs Logs
Metrics
Metrics answer:
"WHAT is wrong?"
Examples:
- CPU 90%
- Memory 85%
- Request latency
- Error rate
Metrics are numerical time-series data.
Logs
Logs answer:
"WHY is it wrong?"
Examples:
- database timeout
- authentication failure
- stack trace
- nginx 500 error
Logs are text.
Part 19 — Prometheus
Prometheus stores metrics.
Prometheus is:
Time-series database.
Prometheus stores:
- CPU history
- Memory history
- Request history
- Error history
- Latency history
Prometheus uses:
PromQL.
Part 20 — Grafana
Grafana visualizes telemetry.
Grafana itself does NOT store:
- metrics
- logs
Grafana reads from:
- Prometheus
- Loki
- Tempo
- CloudWatch
- Elasticsearch
Grafana creates:
- dashboards
- alerts
- graphs
- log search
Part 21 — Loki
Loki stores logs centrally.
Instead of:
logging into every machine,
Loki centralizes logs.
All systems send logs into Loki.
Grafana can search them.
Part 22 — Alloy
Alloy is telemetry pipeline agent.
Alloy can:
- collect metrics
- collect logs
- collect traces
- forward telemetry
Alloy sends data to:
- Prometheus
- Loki
- Tempo
- Grafana Cloud
Important:
Alloy is NOT main storage.
Alloy transports telemetry.
Part 23 — Node Exporter
Node Exporter exposes Linux host metrics.
Examples:
- CPU usage
- RAM usage
- Disk usage
- Filesystem metrics
- Network metrics
Node Exporter produces metrics endpoint:
/metrics
Prometheus scrapes it.
Part 24 — Why Node Exporter Is Different In Fargate
In EC2:
You control Linux host.
You can install Node Exporter.
In Fargate:
AWS hides:
- host OS
- kernel
- hardware
- hypervisor
So you cannot install Node Exporter on Fargate host.
Instead we use:
- ECS telemetry
- Alloy
- OpenTelemetry
- CloudWatch metrics
Part 25 — Sidecar Containers
Sidecar means:
Second container inside same task/pod.
Example:
- application container
- alloy sidecar
Why?
Sidecar can:
- collect local logs
- collect metrics
- forward telemetry
Part 26 — Why We Separate Services
Real production architecture separates:
- application
- Prometheus
- Grafana
- Loki
- Alloy
Why?
Different scaling requirements.
Different CPU usage.
Different memory usage.
Avoid single point of failure.
Part 27 — Final Production Architecture
Internet
↓
CloudFront
↓
ALB
↓
ECS Fargate Application
↓
Metrics → Prometheus
↓
Logs → Alloy → Loki
↓
Grafana dashboards
Part 28 — IAM Roles In ECS
Two important roles.
Task Execution Role
Used by ECS infrastructure.
Allows:
- pulling images
- CloudWatch logs
- ECS startup actions
Task Role
Used by application container.
Allows application access to:
- S3
- DynamoDB
- Secrets Manager
- AWS APIs
Part 29 — Security Groups
Security Groups are virtual firewalls.
They control:
- inbound traffic
- outbound traffic
Example:
Allow:
HTTP 80
HTTPS 443
Grafana 3000
Prometheus 9090
Loki 3100
Part 30 — ECS Troubleshooting Learned In Lab
Real troubleshooting scenarios encountered:
Image Pull Failure
CannotPullContainerError
403 Forbidden
Cause:
Private registry permissions.
Fix:
Use public container image.
Deployment Rollback
ECS deployment rollback failed.
Cause:
Containers failing during deployment.
Alloy Command Parsing Issue
Wrong command:
run /etc/alloy/fargate.alloy
Correct ECS array syntax:
run,/etc/alloy/fargate.alloy
Missing Config File
Alloy failed because:
/etc/alloy/fargate.alloy
was not mounted.
Part 31 — Real SRE Workflow
Real workflow:
Deploy
↓
Observe failure
↓
Read logs
↓
Find root cause
↓
Fix configuration
↓
Redeploy
↓
Validate telemetry
This is real production engineering.
Part 32 — Why Metrics And Logs Together Matter
Metrics tell:
WHAT is wrong.
Logs tell:
WHY it is wrong.
Example:
Metrics:
CPU 95%
Logs:
Database timeout causing retries.
Together they explain outages.
Part 33 — Why Modern Systems Need Observability
Modern systems are distributed.
Many:
- containers
- services
- APIs
- databases
- networks
Without observability:
troubleshooting becomes impossible.
Part 34 — What Students Learned In This Lab
Students learned:
- ECS
- Fargate
- Task Definitions
- Services
- IAM Roles
- Security Groups
- CloudWatch Logs
- Deployment failures
- Rollbacks
- Container troubleshooting
- Sidecars
- Metrics vs logs
- Observability
- Prometheus
- Grafana
- Loki
- Alloy
- Distributed systems thinking
Part 35 — Final Important Concepts
Docker
Packages application.
ECS/Kubernetes
Runs applications reliably.
ALB
Routes traffic.
CloudFront
Distributes globally.
Prometheus
Stores metrics.
Loki
Stores logs.
Grafana
Visualizes telemetry.
Alloy
Collects/transports telemetry.
Node Exporter
Produces Linux host metrics.
Part 36 — Enterprise SRE Mindset
Modern SRE engineers think about:
- scalability
- observability
- automation
- reliability
- distributed systems
- telemetry
- infrastructure
- deployment safety
- failure recovery
- centralized monitoring
This is the foundation of modern cloud-native engineering.
Top comments (0)