Aisalkyn Aidarova

Posted on May 28

ECS + FARGATE + CONTAINERIZATION + OBSERVABILITY + PRODUCTION ARCHITECTURE

#architecture #aws #docker #monitoring

Today we are going to understand how real production systems work inside AWS using:

Docker
ECS
Fargate
ALB
Prometheus
Grafana
Loki
Alloy
Node Exporter

This is not just theory.

This is how real companies build scalable cloud applications.

SECTION 1 — HOW APPLICATIONS WERE RUN BEFORE CLOUD

Years ago applications were installed directly on servers.

For example:

One server contained:

Java application
MySQL
Apache
Redis
Monitoring tools

Everything installed directly into Linux OS.

Example:

```bash id="x0clol"
sudo apt install mysql
sudo apt install apache2
sudo apt install python




---

# Problems With Traditional Servers

Imagine company has:

* 100 developers
* 20 applications
* thousands of users

Now problems start.

---

# Problem 1 — Dependency Conflicts

One application requires:



```text id="0vn8u6"
Python 2.7

Another application requires:

```text id="2dsgq5"
Python 3.11




Both applications break each other.

---

# Problem 2 — Hard Scaling

Traffic increases.

Company must:

* buy new servers
* install software again
* configure manually

Very slow.

---

# Problem 3 — “Works on My Machine”

Developer says:



```text id="vrl8n4"
It works on my laptop.

But production fails.

Why?

Different:

OS versions
libraries
dependencies
configurations

Problem 4 — Resource Waste

Application uses:

```text id="q31vkn"
10% CPU




But whole server reserved.

Huge waste.

---

# SECTION 2 — VIRTUAL MACHINES

Then virtualization came.

Tools:

* VMware
* VirtualBox
* Hyper-V

One physical server could run multiple VMs.

---

# What Is Virtual Machine?

VM is software-based computer.

Each VM has:

* full OS
* kernel
* RAM
* storage

Architecture:



```text id="kp8kkh"
Physical Server
      ↓
Hypervisor
      ↓
VM1
VM2
VM3

VM Advantages

isolation
multiple OS
better resource usage

VM Problems

Still heavy.

Each VM requires:

full operating system
boot process
kernel memory

Startup slow.

Consumes large resources.

SECTION 3 — CONTAINERIZATION REVOLUTION

Then containers changed IT industry.

Most famous tool:

Docker

What Is Container?

Container is isolated lightweight package containing:

application code
dependencies
runtime
libraries
binaries

BUT unlike VM:

containers share host kernel

This makes containers lightweight.

Container vs VM

Virtual Machine	Container
Full OS	Shared kernel
Heavy	Lightweight
Slow startup	Fast startup
Large size	Small size
More resources	Less resources

Example Without Docker

Developer installs:

```bash id="91u2m3"
sudo apt install nodejs
npm install
node app.js




Production server may fail.

---

# Example With Docker

Create Dockerfile:



```dockerfile id="mpzyuh"
FROM node:18

WORKDIR /app

COPY . .

RUN npm install

CMD ["npm","start"]

Build image:

```bash id="9o7ujq"
docker build -t myapp .




Run container:



```bash id="e4zn4g"
docker run -p 3000:3000 myapp

Now application behaves SAME everywhere.

Main Idea of Containerization

Package application once.

Run anywhere.

SECTION 4 — WHY CONTAINERS BECAME ESSENTIAL

Modern companies require:

scalability
automation
fast deployment
microservices
CI/CD
portability

Containers solve these problems.

Real Example

Netflix-like architecture:

frontend
backend
payments
recommendation engine
authentication

Each service runs independently.

Each service containerized.

SECTION 5 — WHY DOCKER ALONE IS NOT ENOUGH

Docker can:

build containers
run containers

BUT production environment has:

hundreds of containers
scaling
failures
networking
deployments
monitoring

Docker alone cannot manage all.

Need orchestration platform.

SECTION 6 — WHAT IS ECS?

Amazon Elastic Container Service is AWS container orchestration service.

Main purpose:

Manage containers automatically in production.

ECS Responsibilities

ECS:

runs containers
restarts failed containers
scales applications
deploys new versions
manages networking
integrates ALB
monitors health

Important Concept

Docker CREATES containers.

ECS MANAGES containers.

SECTION 7 — ECS CORE COMPONENTS

1. ECS Cluster

Logical group of infrastructure.

Example:

```text id="31sk40"
production-cluster




Inside cluster:

* services
* tasks
* containers

---

# 2. Task Definition

Blueprint/template of application.

Contains:

* image
* CPU
* memory
* ports
* logging
* environment variables

Example:



```json id="4hyq4x"
{
  "image": "nginx",
  "cpu": 256,
  "memory": 512,
  "port": 80
}

3. Task

Running copy of task definition.

Think:

Component	Meaning
Task Definition	Recipe
Task	Running instance

4. ECS Service

Maintains desired number of tasks.

Example:

```text id="pk5zhv"
Desired Count = 3




If task crashes:
ECS automatically creates new one.

Self-healing.

---

# SECTION 8 — ECS LAUNCH TYPES

ECS supports TWO ways to run containers.

---

# OPTION 1 — EC2 Launch Type

Architecture:



```text id="rmlr9r"
ECS
   ↓
EC2 Instances
   ↓
Docker Containers

You manage:

EC2 servers
OS patching
AMIs
autoscaling groups
capacity planning

OPTION 2 — FARGATE Launch Type

AWS Fargate provides serverless compute for containers.

Architecture:

```text id="6rzzcl"
ECS
↓
Fargate
↓
Containers




No EC2 management.

---

# SECTION 9 — WHAT EXACTLY IS FARGATE?

Important interview question.

Fargate is NOT orchestration.

ECS orchestrates.

Fargate provides infrastructure.

---

# Analogy

Imagine restaurant.

ECS = restaurant manager
Fargate = kitchen infrastructure

Manager organizes work.

Kitchen provides place to cook.

---

# Why Need Fargate If ECS Exists?

Containers need:

* CPU
* RAM
* networking
* storage

Question:
Where will containers physically run?

Answer:

* EC2
  OR
* Fargate

---

# Without Fargate

YOU manage:

* EC2
* patching
* security
* autoscaling
* AMIs

---

# With Fargate

AWS manages infrastructure.

You only provide:

* image
* CPU
* memory

---

# SECTION 10 — ECS VS ALB

Application Load Balancer distributes incoming traffic.

---

# ALB Responsibilities

ALB:

* receives requests
* routes traffic
* SSL termination
* health checks
* load balancing

---

# ECS Responsibilities

ECS:

* runs containers
* scales containers
* restarts containers

---

# Difference

| ECS                     | ALB                |
| ----------------------- | ------------------ |
| Runs applications       | Routes requests    |
| Manages tasks           | Balances traffic   |
| Orchestrates containers | Handles HTTP/HTTPS |

---

# SECTION 11 — COMPLETE APPLICATION FLOW

Production architecture:



```text id="aq6j7j"
User
 ↓
Route53
 ↓
ALB
 ↓
ECS Service
 ↓
Fargate Tasks
 ↓
Containers
 ↓
Database

Flow Explanation

Step 1

User opens:

```text id="brj6c2"
app.company.com




---

## Step 2

DNS resolves domain.

---

## Step 3

ALB receives request.

---

## Step 4

ALB forwards request to healthy ECS tasks.

---

## Step 5

Container processes request.

---

## Step 6

Response returned.

---

# SECTION 12 — WHY PRIVATE SUBNETS?

Best practice:

| Component | Location       |
| --------- | -------------- |
| ALB       | Public subnet  |
| ECS Tasks | Private subnet |
| Database  | Private subnet |

---

# Why?

Security.

Users should NEVER directly access containers.

Only ALB exposed publicly.

---

# SECTION 13 — ECR

Amazon Elastic Container Registry stores Docker images.

Like DockerHub inside AWS.

---

# Deployment Flow



```text id="08s7d9"
Developer
   ↓
GitHub
   ↓
CI/CD Pipeline
   ↓
Docker Build
   ↓
Push to ECR
   ↓
Update ECS Service
   ↓
Deploy New Tasks

SECTION 14 — HEALTH CHECKS

ALB checks:

```text id="c6rw0d"
/health




If unhealthy:

* ALB stops routing traffic
* ECS replaces task

Self-healing.

---

# SECTION 15 — WHY OBSERVABILITY IS CRITICAL

Production systems fail.

Need answers:

* Why app crashed?
* Why slow?
* Which task failed?
* CPU high?
* Logs errors?
* Memory leak?

This is observability.

---

# Three Pillars of Observability

| Pillar  | Meaning            |
| ------- | ------------------ |
| Metrics | Numerical data     |
| Logs    | Application events |
| Traces  | Request journey    |

---

# SECTION 16 — OBSERVABILITY TOOLS

| Tool          | Purpose                |
| ------------- | ---------------------- |
| Prometheus    | Metrics collection     |
| Grafana       | Visualization          |
| Loki          | Logs storage           |
| Alloy         | Log collection         |
| Node Exporter | Infrastructure metrics |

---

# SECTION 17 — DEMO APPLICATION SERVICE

This is actual business application.

Examples:

* ecommerce app
* API
* frontend
* backend

---

# Why Separate ECS Service?

Microservices architecture.

Example:



```text id="yt8on2"
frontend-service
backend-service
auth-service
payment-service

Each service independently scalable.

SECTION 18 — PROMETHEUS

Prometheus collects metrics.

What Are Metrics?

Numbers representing system state.

Examples:

CPU usage
memory usage
requests/sec
latency
error rates

Prometheus Pull Model

Prometheus SCRAPES endpoints:

```text id="ikx4fp"
/metrics




---

# Flow



```text id="69rf7s"
Prometheus
     ↓
Application /metrics endpoint

What DevOps Checks

Targets UP/DOWN

Healthy?

Scrape Errors

Metrics accessible?

Query Performance

PromQL functioning?

Storage Usage

Prometheus database healthy?

SECTION 19 — NODE EXPORTER

Prometheus Node Exporter exports OS-level metrics.

Metrics Collected

CPU
RAM
disk
filesystem
network

Flow

```text id="0n3z2d"
Node Exporter
↓
Prometheus
↓
Grafana




---

# Why Important?

Application failures often caused by:

* disk full
* CPU saturation
* memory exhaustion

Need infrastructure visibility.

---

# SECTION 20 — GRAFANA

Grafana visualizes metrics and logs.

---

# Important

Grafana DOES NOT collect data.

Prometheus collects.

Grafana visualizes.

---

# Grafana Reads

* Prometheus
* Loki

---

# What DevOps Checks

## Dashboards

CPU?
Memory?
Traffic?

---

## Alerts

* high latency
* high CPU
* application down

---

## Traffic Trends

Traffic increase patterns.

---

# SECTION 21 — LOKI

Grafana Loki stores logs centrally.

---

# Log Examples



```text id="4oz1d6"
ERROR database timeout
INFO login successful
FATAL application crashed

Why Loki?

Without Loki:
Need SSH into containers.

Very bad in production.

Loki Architecture

```text id="fj6m9o"
Containers
↓
Alloy
↓
Loki
↓
Grafana




---

# SECTION 22 — ALLOY

Grafana Alloy collects and forwards logs/metrics.

---

# Alloy Responsibilities

Acts like:

* collector
* shipper
* forwarder

---

# Why Needed?

Containers generate logs continuously.

Loki cannot automatically collect all logs directly.

Need collection agent.

---

# Flow



```text id="s3ry1d"
Container Logs
     ↓
Alloy
     ↓
Loki

SECTION 23 — WHY SEPARATE ECS SERVICES?

Production cluster:

```text id="eqwxjg"
production-cluster
├── demo-app-service
├── prometheus-service
├── grafana-service
├── loki-service
├── alloy-service
└── node-exporter-service




---

# Why Separate?

## Independent Scaling

Prometheus may need more CPU.

Loki may need more storage.

---

## Isolation

Grafana crash should not affect application.

---

## Easier Upgrades

Upgrade Grafana independently.

---

## Better Security

Different IAM roles.

---

## Better Resource Allocation

Different memory/CPU.

---

# SECTION 24 — WHAT DEVOPS ENGINEER SETS UP

---

# Networking

* VPC
* subnets
* route tables
* NAT Gateway
* Internet Gateway
* security groups

---

# ECS Infrastructure

* ECS cluster
* task definitions
* ECS services

---

# Load Balancing

* ALB
* target groups
* listeners
* SSL certificates

---

# CI/CD

* Jenkins
* GitHub Actions
* GitLab CI

---

# Containerization

* Dockerfiles
* ECR repositories

---

# Observability

* Prometheus
* Grafana
* Loki
* Alloy
* alerts
* dashboards

---

# Security

* IAM roles
* secrets
* TLS certificates

---

# Scaling

* ECS autoscaling
* CPU policies
* memory policies

---

# SECTION 25 — DAILY DEVOPS MONITORING

---

# ECS

Check:

* task restarts
* unhealthy tasks
* deployment failures

---

# ALB

Check:

* 5XX errors
* latency
* unhealthy targets

---

# Prometheus

Check:

* targets DOWN
* scrape failures

---

# Grafana

Check:

* alerts
* spikes
* trends

---

# Loki

Check:

* errors
* crashes
* timeouts

---

# Alloy

Check:

* forwarding errors
* dropped logs

---

# Node Exporter

Check:

* CPU
* memory
* disk usage
* network traffic

---

# SECTION 26 — REAL INCIDENT TROUBLESHOOTING

Users complain:



```text id="gd12ya"
Website slow

Investigation

Step 1 — Grafana

CPU spike detected.

Step 2 — Prometheus

Latency:

```text id="3zskdh"
5 seconds




---

## Step 3 — Loki

Logs show:



```text id="8m9g6s"
Database timeout

Step 4 — ECS

Tasks restarting repeatedly.

Root Cause

Database connection pool exhausted.

SECTION 27 — ECS ADVANTAGES

Easy AWS Integration

Works perfectly with:

IAM
Route53
CloudWatch
ALB
ECR

Easier Than Kubernetes

Simpler learning curve.

Self-Healing

Automatic task replacement.

High Availability

Multi-AZ deployments.

Fast Deployments

Rolling updates.

SECTION 28 — ECS DISADVANTAGES

AWS Vendor Lock-In

Mostly AWS ecosystem.

Less Flexible Than Kubernetes

Kubernetes more portable.

Fargate Cost

Can become expensive at scale.

SECTION 29 — WHEN TO USE ECS FARGATE

Best for:

APIs
microservices
startups
AWS-native systems
small DevOps teams

SECTION 30 — COMPLETE FINAL ARCHITECTURE

```text id="3s4w5g"
Users
↓
Route53
↓
ALB
↓
ECS Cluster
↓
Fargate Tasks
↓
Demo App Containers
↓
─────────────────────────────
Metrics → Prometheus
Logs → Alloy → Loki
Visualization → Grafana
Infrastructure → Node Exporter
─────────────────────────────
↓
Database




---

# FINAL BIG PICTURE

## Docker

Creates containers.

## ECS

Manages containers.

## Fargate

Provides serverless infrastructure.

## ALB

Routes traffic.

## ECR

Stores images.

## Prometheus

Collects metrics.

## Grafana

Visualizes data.

## Loki

Stores logs.

## Alloy

Collects/ships logs.

## Node Exporter

Infrastructure metrics.

---

# MAIN PURPOSE OF THIS ENTIRE SYSTEM

To build:

* scalable systems
* self-healing systems
* highly available systems
* automated deployments
* centralized monitoring
* centralized logging
* reliable production infrastructure

This is how modern DevOps and SRE teams operate production applications in AWS cloud environments.