DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

lecture: ECS Fargate Prometheus Grafana Loki Alloy Node Exporter

Part 1 — Introduction

This lecture explains:

  • What cloud infrastructure really is
  • Difference between EC2, ECS, Fargate, Kubernetes, Load Balancer
  • Why companies use cloud
  • Why SRE and DevOps engineers exist
  • What observability means
  • Difference between metrics and logs
  • Why we use Prometheus, Grafana, Loki, Alloy, and Node Exporter
  • Real production architecture
  • Real troubleshooting scenarios
  • ECS/Fargate deployment flow
  • Why sidecars exist
  • Why modern systems use centralized observability

This lecture is based on real troubleshooting and deployment scenarios.


Part 2 — What Is Really Happening In The Cloud?

Many beginners think:

"AWS is just hosting my website."

But in reality AWS provides:

  • Physical data centers
  • Servers
  • Networking
  • Internet routing
  • Storage
  • Virtualization
  • Hypervisors
  • Security
  • Scaling infrastructure
  • High availability
  • Global infrastructure

When you deploy an application to AWS:

You are renting compute resources from AWS.


Part 3 — Server vs Virtual Machine

Physical Server

A physical server is:

  • Real hardware
  • CPU
  • RAM
  • Storage
  • Network cards
  • Power supply

Inside AWS data centers there are thousands of physical servers.


Virtual Machine (VM)

A virtual machine is:

  • Software-defined computer
  • Runs on top of physical server
  • Has virtual CPU
  • Virtual RAM
  • Virtual storage

One physical server can run many virtual machines.


Part 4 — Why Cloud Exists

Without cloud:

You would need:

  • Your own server room
  • Cooling
  • Electricity
  • Networking
  • Internet provider
  • Firewalls
  • Routers
  • Hardware replacement
  • OS patching
  • Security
  • Scaling

Cloud providers solve this.


Part 5 — Docker

Docker solves:

"How do we package applications consistently?"

Docker container includes:

  • Application
  • Libraries
  • Dependencies
  • Runtime
  • Configuration

Container can run consistently:

  • laptop
  • EC2
  • ECS
  • Kubernetes
  • cloud

Part 6 — Why Docker Alone Is Not Enough

If you run only one container:

Docker alone may be enough.

But large systems need:

  • scaling
  • failover
  • networking
  • deployment automation
  • service discovery
  • self-healing
  • orchestration

This is why Kubernetes and ECS exist.


Part 7 — ECS vs Kubernetes

ECS

AWS-native container orchestrator.

Simpler.

Good integration with AWS.


Kubernetes

Industry-standard orchestration platform.

More powerful.

More complex.

Used heavily by large enterprises.


Part 8 — What Is ECS?

ECS = Elastic Container Service.

ECS does:

  • Runs containers
  • Restarts failed containers
  • Deploys applications
  • Scales containers
  • Handles networking
  • Manages task lifecycle

ECS is NOT:

  • Load balancer
  • Database
  • Monitoring system

Part 9 — What Is Fargate?

Fargate is serverless container infrastructure.

AWS manages:

  • servers
  • patching
  • hypervisor
  • scaling infrastructure
  • hardware
  • OS maintenance

You manage only:

  • containers
  • task definitions
  • services

Part 10 — ECS Cluster Does NOT Mean One Machine

Important concept.

ECS cluster is:

Logical grouping for tasks/services.

One ECS cluster can run:

  • many tasks
  • many services
  • many applications

Example:

  • app-service
  • grafana-service
  • prometheus-service
  • loki-service
  • alloy-service

all inside ONE ECS cluster.


Part 11 — ECS Task Definition

Task definition describes:

  • container image
  • CPU
  • memory
  • ports
  • environment variables
  • IAM roles
  • commands
  • networking

Task definition is like:

Blueprint/template for containers.


Part 12 — ECS Service

Service keeps tasks alive.

If container crashes:

ECS service recreates it automatically.

This is real production behavior.


Part 13 — Load Balancer vs ECS

Many beginners confuse this.

Load Balancer (ALB)

ALB ONLY distributes traffic.

ALB does NOT:

  • run applications
  • restart containers
  • deploy apps

ALB routes traffic.


ECS

ECS actually:

  • launches containers
  • restarts containers
  • scales tasks
  • deploys revisions

Part 14 — Real Production Architecture

Simple Architecture

Users

Public IP

EC2

Docker container

Good for small applications.


Better Production Architecture

Users

CloudFront CDN

ALB

ECS Fargate

Containers


Part 15 — Why CloudFront Exists

CloudFront is CDN.

It distributes content globally.

Without CDN:

All users hit one region.

Example:

Only us-east-1.

Users in Asia experience latency.

CloudFront caches closer to users.


Part 16 — What Is SRE?

SRE = Site Reliability Engineering.

SRE focuses on:

  • uptime
  • monitoring
  • reliability
  • scaling
  • observability
  • alerting
  • automation
  • troubleshooting

Part 17 — What Is Observability?

Observability means:

Understanding system behavior.

Three major pillars:

  1. Metrics
  2. Logs
  3. Traces

Part 18 — Metrics vs Logs

Metrics

Metrics answer:

"WHAT is wrong?"

Examples:

  • CPU 90%
  • Memory 85%
  • Request latency
  • Error rate

Metrics are numerical time-series data.


Logs

Logs answer:

"WHY is it wrong?"

Examples:

  • database timeout
  • authentication failure
  • stack trace
  • nginx 500 error

Logs are text.


Part 19 — Prometheus

Prometheus stores metrics.

Prometheus is:

Time-series database.

Prometheus stores:

  • CPU history
  • Memory history
  • Request history
  • Error history
  • Latency history

Prometheus uses:

PromQL.


Part 20 — Grafana

Grafana visualizes telemetry.

Grafana itself does NOT store:

  • metrics
  • logs

Grafana reads from:

  • Prometheus
  • Loki
  • Tempo
  • CloudWatch
  • Elasticsearch

Grafana creates:

  • dashboards
  • alerts
  • graphs
  • log search

Part 21 — Loki

Loki stores logs centrally.

Instead of:

logging into every machine,

Loki centralizes logs.

All systems send logs into Loki.

Grafana can search them.


Part 22 — Alloy

Alloy is telemetry pipeline agent.

Alloy can:

  • collect metrics
  • collect logs
  • collect traces
  • forward telemetry

Alloy sends data to:

  • Prometheus
  • Loki
  • Tempo
  • Grafana Cloud

Important:

Alloy is NOT main storage.

Alloy transports telemetry.


Part 23 — Node Exporter

Node Exporter exposes Linux host metrics.

Examples:

  • CPU usage
  • RAM usage
  • Disk usage
  • Filesystem metrics
  • Network metrics

Node Exporter produces metrics endpoint:

/metrics

Prometheus scrapes it.


Part 24 — Why Node Exporter Is Different In Fargate

In EC2:

You control Linux host.

You can install Node Exporter.


In Fargate:

AWS hides:

  • host OS
  • kernel
  • hardware
  • hypervisor

So you cannot install Node Exporter on Fargate host.

Instead we use:

  • ECS telemetry
  • Alloy
  • OpenTelemetry
  • CloudWatch metrics

Part 25 — Sidecar Containers

Sidecar means:

Second container inside same task/pod.

Example:

  • application container
  • alloy sidecar

Why?

Sidecar can:

  • collect local logs
  • collect metrics
  • forward telemetry

Part 26 — Why We Separate Services

Real production architecture separates:

  • application
  • Prometheus
  • Grafana
  • Loki
  • Alloy

Why?

Different scaling requirements.

Different CPU usage.

Different memory usage.

Avoid single point of failure.


Part 27 — Final Production Architecture

Internet

CloudFront

ALB

ECS Fargate Application

Metrics → Prometheus

Logs → Alloy → Loki

Grafana dashboards


Part 28 — IAM Roles In ECS

Two important roles.

Task Execution Role

Used by ECS infrastructure.

Allows:

  • pulling images
  • CloudWatch logs
  • ECS startup actions

Task Role

Used by application container.

Allows application access to:

  • S3
  • DynamoDB
  • Secrets Manager
  • AWS APIs

Part 29 — Security Groups

Security Groups are virtual firewalls.

They control:

  • inbound traffic
  • outbound traffic

Example:

Allow:

HTTP 80
HTTPS 443
Grafana 3000
Prometheus 9090
Loki 3100


Part 30 — ECS Troubleshooting Learned In Lab

Real troubleshooting scenarios encountered:

Image Pull Failure

CannotPullContainerError
403 Forbidden

Cause:

Private registry permissions.

Fix:

Use public container image.


Deployment Rollback

ECS deployment rollback failed.

Cause:

Containers failing during deployment.


Alloy Command Parsing Issue

Wrong command:

run /etc/alloy/fargate.alloy

Correct ECS array syntax:

run,/etc/alloy/fargate.alloy


Missing Config File

Alloy failed because:

/etc/alloy/fargate.alloy

was not mounted.


Part 31 — Real SRE Workflow

Real workflow:

Deploy

Observe failure

Read logs

Find root cause

Fix configuration

Redeploy

Validate telemetry

This is real production engineering.


Part 32 — Why Metrics And Logs Together Matter

Metrics tell:

WHAT is wrong.

Logs tell:

WHY it is wrong.

Example:

Metrics:

CPU 95%

Logs:

Database timeout causing retries.

Together they explain outages.


Part 33 — Why Modern Systems Need Observability

Modern systems are distributed.

Many:

  • containers
  • services
  • APIs
  • databases
  • networks

Without observability:

troubleshooting becomes impossible.


Part 34 — What Students Learned In This Lab

Students learned:

  • ECS
  • Fargate
  • Task Definitions
  • Services
  • IAM Roles
  • Security Groups
  • CloudWatch Logs
  • Deployment failures
  • Rollbacks
  • Container troubleshooting
  • Sidecars
  • Metrics vs logs
  • Observability
  • Prometheus
  • Grafana
  • Loki
  • Alloy
  • Distributed systems thinking

Part 35 — Final Important Concepts

Docker

Packages application.


ECS/Kubernetes

Runs applications reliably.


ALB

Routes traffic.


CloudFront

Distributes globally.


Prometheus

Stores metrics.


Loki

Stores logs.


Grafana

Visualizes telemetry.


Alloy

Collects/transports telemetry.


Node Exporter

Produces Linux host metrics.


Part 36 — Enterprise SRE Mindset

Modern SRE engineers think about:

  • scalability
  • observability
  • automation
  • reliability
  • distributed systems
  • telemetry
  • infrastructure
  • deployment safety
  • failure recovery
  • centralized monitoring

This is the foundation of modern cloud-native engineering.

Top comments (0)