Aisalkyn Aidarova

Posted on May 25

lecture: ECS Fargate Prometheus Grafana Loki Alloy Node Exporter

Part 1 — Introduction

This lecture explains:

What cloud infrastructure really is
Difference between EC2, ECS, Fargate, Kubernetes, Load Balancer
Why companies use cloud
Why SRE and DevOps engineers exist
What observability means
Difference between metrics and logs
Why we use Prometheus, Grafana, Loki, Alloy, and Node Exporter
Real production architecture
Real troubleshooting scenarios
ECS/Fargate deployment flow
Why sidecars exist
Why modern systems use centralized observability

This lecture is based on real troubleshooting and deployment scenarios.

Part 2 — What Is Really Happening In The Cloud?

Many beginners think:

"AWS is just hosting my website."

But in reality AWS provides:

Physical data centers
Servers
Networking
Internet routing
Storage
Virtualization
Hypervisors
Security
Scaling infrastructure
High availability
Global infrastructure

When you deploy an application to AWS:

You are renting compute resources from AWS.

Part 3 — Server vs Virtual Machine

Physical Server

A physical server is:

Real hardware
CPU
RAM
Storage
Network cards
Power supply

Inside AWS data centers there are thousands of physical servers.

Virtual Machine (VM)

A virtual machine is:

Software-defined computer
Runs on top of physical server
Has virtual CPU
Virtual RAM
Virtual storage

One physical server can run many virtual machines.

Part 4 — Why Cloud Exists

Without cloud:

You would need:

Your own server room
Cooling
Electricity
Networking
Internet provider
Firewalls
Routers
Hardware replacement
OS patching
Security
Scaling

Cloud providers solve this.

Part 5 — Docker

Docker solves:

"How do we package applications consistently?"

Docker container includes:

Application
Libraries
Dependencies
Runtime
Configuration

Container can run consistently:

laptop
EC2
ECS
Kubernetes
cloud

Part 6 — Why Docker Alone Is Not Enough

If you run only one container:

Docker alone may be enough.

But large systems need:

scaling
failover
networking
deployment automation
service discovery
self-healing
orchestration

This is why Kubernetes and ECS exist.

Part 7 — ECS vs Kubernetes

ECS

AWS-native container orchestrator.

Simpler.

Good integration with AWS.

Kubernetes

Industry-standard orchestration platform.

More powerful.

More complex.

Used heavily by large enterprises.

Part 8 — What Is ECS?

ECS = Elastic Container Service.

ECS does:

Runs containers
Restarts failed containers
Deploys applications
Scales containers
Handles networking
Manages task lifecycle

ECS is NOT:

Load balancer
Database
Monitoring system

Part 9 — What Is Fargate?

Fargate is serverless container infrastructure.

AWS manages:

servers
patching
hypervisor
scaling infrastructure
hardware
OS maintenance

You manage only:

containers
task definitions
services

Part 10 — ECS Cluster Does NOT Mean One Machine

Important concept.

ECS cluster is:

Logical grouping for tasks/services.

One ECS cluster can run:

many tasks
many services
many applications

Example:

app-service
grafana-service
prometheus-service
loki-service
alloy-service

all inside ONE ECS cluster.

Part 11 — ECS Task Definition

Task definition describes:

container image
CPU
memory
ports
environment variables
IAM roles
commands
networking

Task definition is like:

Blueprint/template for containers.

Part 12 — ECS Service

Service keeps tasks alive.

If container crashes:

ECS service recreates it automatically.

This is real production behavior.

Part 13 — Load Balancer vs ECS

Many beginners confuse this.

Load Balancer (ALB)

ALB ONLY distributes traffic.

ALB does NOT:

run applications
restart containers
deploy apps

ALB routes traffic.

ECS

ECS actually:

launches containers
restarts containers
scales tasks
deploys revisions

Part 14 — Real Production Architecture

Simple Architecture

Users
↓
Public IP
↓
EC2
↓
Docker container

Good for small applications.

Better Production Architecture

Users
↓
CloudFront CDN
↓
ALB
↓
ECS Fargate
↓
Containers

Part 15 — Why CloudFront Exists

CloudFront is CDN.

It distributes content globally.

Without CDN:

All users hit one region.

Example:

Only us-east-1.

Users in Asia experience latency.

CloudFront caches closer to users.

Part 16 — What Is SRE?

SRE = Site Reliability Engineering.

SRE focuses on:

uptime
monitoring
reliability
scaling
observability
alerting
automation
troubleshooting

Part 17 — What Is Observability?

Observability means:

Understanding system behavior.

Three major pillars:

Metrics
Logs
Traces

Part 18 — Metrics vs Logs

Metrics

Metrics answer:

"WHAT is wrong?"

Examples:

CPU 90%
Memory 85%
Request latency
Error rate

Metrics are numerical time-series data.

Logs

Logs answer:

"WHY is it wrong?"

Examples:

database timeout
authentication failure
stack trace
nginx 500 error

Logs are text.

Part 19 — Prometheus

Prometheus stores metrics.

Prometheus is:

Time-series database.

Prometheus stores:

CPU history
Memory history
Request history
Error history
Latency history

Prometheus uses:

PromQL.

Part 20 — Grafana

Grafana visualizes telemetry.

Grafana itself does NOT store:

metrics
logs

Grafana reads from:

Prometheus
Loki
Tempo
CloudWatch
Elasticsearch

Grafana creates:

dashboards
alerts
graphs
log search

Part 21 — Loki

Loki stores logs centrally.

Instead of:

logging into every machine,

Loki centralizes logs.

All systems send logs into Loki.

Grafana can search them.

Part 22 — Alloy

Alloy is telemetry pipeline agent.

Alloy can:

collect metrics
collect logs
collect traces
forward telemetry

Alloy sends data to:

Prometheus
Loki
Tempo
Grafana Cloud

Important:

Alloy is NOT main storage.

Alloy transports telemetry.

Part 23 — Node Exporter

Node Exporter exposes Linux host metrics.

Examples:

CPU usage
RAM usage
Disk usage
Filesystem metrics
Network metrics

Node Exporter produces metrics endpoint:

/metrics

Prometheus scrapes it.

Part 24 — Why Node Exporter Is Different In Fargate

In EC2:

You control Linux host.

You can install Node Exporter.

In Fargate:

AWS hides:

host OS
kernel
hardware
hypervisor

So you cannot install Node Exporter on Fargate host.

Instead we use:

ECS telemetry
Alloy
OpenTelemetry
CloudWatch metrics

Part 25 — Sidecar Containers

Sidecar means:

Second container inside same task/pod.

Example:

application container
alloy sidecar

Why?

Sidecar can:

collect local logs
collect metrics
forward telemetry

Part 26 — Why We Separate Services

Real production architecture separates:

application
Prometheus
Grafana
Loki
Alloy

Why?

Different scaling requirements.

Different CPU usage.

Different memory usage.

Avoid single point of failure.

Part 27 — Final Production Architecture

Internet
↓
CloudFront
↓
ALB
↓
ECS Fargate Application
↓
Metrics → Prometheus
↓
Logs → Alloy → Loki
↓
Grafana dashboards

Part 28 — IAM Roles In ECS

Two important roles.

Task Execution Role

Used by ECS infrastructure.

Allows:

pulling images
CloudWatch logs
ECS startup actions

Task Role

Used by application container.

Allows application access to:

S3
DynamoDB
Secrets Manager
AWS APIs

Part 29 — Security Groups

Security Groups are virtual firewalls.

They control:

inbound traffic
outbound traffic

Example:

Allow:

HTTP 80
HTTPS 443
Grafana 3000
Prometheus 9090
Loki 3100

Part 30 — ECS Troubleshooting Learned In Lab

Real troubleshooting scenarios encountered:

Image Pull Failure

CannotPullContainerError
403 Forbidden

Cause:

Private registry permissions.

Fix:

Use public container image.

Deployment Rollback

ECS deployment rollback failed.

Cause:

Containers failing during deployment.

Alloy Command Parsing Issue

Wrong command:

run /etc/alloy/fargate.alloy

Correct ECS array syntax:

run,/etc/alloy/fargate.alloy

Missing Config File

Alloy failed because:

/etc/alloy/fargate.alloy

was not mounted.

Part 31 — Real SRE Workflow

Real workflow:

Deploy
↓
Observe failure
↓
Read logs
↓
Find root cause
↓
Fix configuration
↓
Redeploy
↓
Validate telemetry

This is real production engineering.

Part 32 — Why Metrics And Logs Together Matter

Metrics tell:

WHAT is wrong.

Logs tell:

WHY it is wrong.

Example:

Metrics:

CPU 95%

Logs:

Database timeout causing retries.

Together they explain outages.

Part 33 — Why Modern Systems Need Observability

Modern systems are distributed.

Many:

containers
services
APIs
databases
networks

Without observability:

troubleshooting becomes impossible.

Part 34 — What Students Learned In This Lab

Students learned:

ECS
Fargate
Task Definitions
Services
IAM Roles
Security Groups
CloudWatch Logs
Deployment failures
Rollbacks
Container troubleshooting
Sidecars
Metrics vs logs
Observability
Prometheus
Grafana
Loki
Alloy
Distributed systems thinking

Part 35 — Final Important Concepts

Docker

Packages application.

ECS/Kubernetes

Runs applications reliably.

ALB

Routes traffic.

CloudFront

Distributes globally.

Prometheus

Stores metrics.

Loki

Stores logs.

Grafana

Visualizes telemetry.

Alloy

Collects/transports telemetry.

Node Exporter

Produces Linux host metrics.

Part 36 — Enterprise SRE Mindset

Modern SRE engineers think about:

scalability
observability
automation
reliability
distributed systems
telemetry
infrastructure
deployment safety
failure recovery
centralized monitoring

This is the foundation of modern cloud-native engineering.