DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

ECS + FARGATE + CONTAINERIZATION + OBSERVABILITY + PRODUCTION ARCHITECTURE

Today we are going to understand how real production systems work inside AWS using:

  • Docker
  • ECS
  • Fargate
  • ALB
  • Prometheus
  • Grafana
  • Loki
  • Alloy
  • Node Exporter

This is not just theory.

This is how real companies build scalable cloud applications.


SECTION 1 — HOW APPLICATIONS WERE RUN BEFORE CLOUD

Years ago applications were installed directly on servers.

For example:

One server contained:

  • Java application
  • MySQL
  • Apache
  • Redis
  • Monitoring tools

Everything installed directly into Linux OS.

Example:

```bash id="x0clol"
sudo apt install mysql
sudo apt install apache2
sudo apt install python




---

# Problems With Traditional Servers

Imagine company has:

* 100 developers
* 20 applications
* thousands of users

Now problems start.

---

# Problem 1 — Dependency Conflicts

One application requires:



```text id="0vn8u6"
Python 2.7
Enter fullscreen mode Exit fullscreen mode

Another application requires:

```text id="2dsgq5"
Python 3.11




Both applications break each other.

---

# Problem 2 — Hard Scaling

Traffic increases.

Company must:

* buy new servers
* install software again
* configure manually

Very slow.

---

# Problem 3 — “Works on My Machine”

Developer says:



```text id="vrl8n4"
It works on my laptop.
Enter fullscreen mode Exit fullscreen mode

But production fails.

Why?

Different:

  • OS versions
  • libraries
  • dependencies
  • configurations

Problem 4 — Resource Waste

Application uses:

```text id="q31vkn"
10% CPU




But whole server reserved.

Huge waste.

---

# SECTION 2 — VIRTUAL MACHINES

Then virtualization came.

Tools:

* VMware
* VirtualBox
* Hyper-V

One physical server could run multiple VMs.

---

# What Is Virtual Machine?

VM is software-based computer.

Each VM has:

* full OS
* kernel
* RAM
* storage

Architecture:



```text id="kp8kkh"
Physical Server
      ↓
Hypervisor
      ↓
VM1
VM2
VM3
Enter fullscreen mode Exit fullscreen mode

VM Advantages

  • isolation
  • multiple OS
  • better resource usage

VM Problems

Still heavy.

Each VM requires:

  • full operating system
  • boot process
  • kernel memory

Startup slow.

Consumes large resources.


SECTION 3 — CONTAINERIZATION REVOLUTION

Then containers changed IT industry.

Most famous tool:

Docker


What Is Container?

Container is isolated lightweight package containing:

  • application code
  • dependencies
  • runtime
  • libraries
  • binaries

BUT unlike VM:

  • containers share host kernel

This makes containers lightweight.


Container vs VM

Virtual Machine Container
Full OS Shared kernel
Heavy Lightweight
Slow startup Fast startup
Large size Small size
More resources Less resources

Example Without Docker

Developer installs:

```bash id="91u2m3"
sudo apt install nodejs
npm install
node app.js




Production server may fail.

---

# Example With Docker

Create Dockerfile:



```dockerfile id="mpzyuh"
FROM node:18

WORKDIR /app

COPY . .

RUN npm install

CMD ["npm","start"]
Enter fullscreen mode Exit fullscreen mode

Build image:

```bash id="9o7ujq"
docker build -t myapp .




Run container:



```bash id="e4zn4g"
docker run -p 3000:3000 myapp
Enter fullscreen mode Exit fullscreen mode

Now application behaves SAME everywhere.


Main Idea of Containerization

Package application once.

Run anywhere.


SECTION 4 — WHY CONTAINERS BECAME ESSENTIAL

Modern companies require:

  • scalability
  • automation
  • fast deployment
  • microservices
  • CI/CD
  • portability

Containers solve these problems.


Real Example

Netflix-like architecture:

  • frontend
  • backend
  • payments
  • recommendation engine
  • authentication

Each service runs independently.

Each service containerized.


SECTION 5 — WHY DOCKER ALONE IS NOT ENOUGH

Docker can:

  • build containers
  • run containers

BUT production environment has:

  • hundreds of containers
  • scaling
  • failures
  • networking
  • deployments
  • monitoring

Docker alone cannot manage all.

Need orchestration platform.


SECTION 6 — WHAT IS ECS?

Amazon Elastic Container Service is AWS container orchestration service.

Main purpose:

Manage containers automatically in production.


ECS Responsibilities

ECS:

  • runs containers
  • restarts failed containers
  • scales applications
  • deploys new versions
  • manages networking
  • integrates ALB
  • monitors health

Important Concept

Docker CREATES containers.

ECS MANAGES containers.


SECTION 7 — ECS CORE COMPONENTS


1. ECS Cluster

Logical group of infrastructure.

Example:

```text id="31sk40"
production-cluster




Inside cluster:

* services
* tasks
* containers

---

# 2. Task Definition

Blueprint/template of application.

Contains:

* image
* CPU
* memory
* ports
* logging
* environment variables

Example:



```json id="4hyq4x"
{
  "image": "nginx",
  "cpu": 256,
  "memory": 512,
  "port": 80
}
Enter fullscreen mode Exit fullscreen mode

3. Task

Running copy of task definition.

Think:

Component Meaning
Task Definition Recipe
Task Running instance

4. ECS Service

Maintains desired number of tasks.

Example:

```text id="pk5zhv"
Desired Count = 3




If task crashes:
ECS automatically creates new one.

Self-healing.

---

# SECTION 8 — ECS LAUNCH TYPES

ECS supports TWO ways to run containers.

---

# OPTION 1 — EC2 Launch Type

Architecture:



```text id="rmlr9r"
ECS
   ↓
EC2 Instances
   ↓
Docker Containers
Enter fullscreen mode Exit fullscreen mode

You manage:

  • EC2 servers
  • OS patching
  • AMIs
  • autoscaling groups
  • capacity planning

OPTION 2 — FARGATE Launch Type

AWS Fargate provides serverless compute for containers.

Architecture:

```text id="6rzzcl"
ECS

Fargate

Containers




No EC2 management.

---

# SECTION 9 — WHAT EXACTLY IS FARGATE?

Important interview question.

Fargate is NOT orchestration.

ECS orchestrates.

Fargate provides infrastructure.

---

# Analogy

Imagine restaurant.

ECS = restaurant manager
Fargate = kitchen infrastructure

Manager organizes work.

Kitchen provides place to cook.

---

# Why Need Fargate If ECS Exists?

Containers need:

* CPU
* RAM
* networking
* storage

Question:
Where will containers physically run?

Answer:

* EC2
  OR
* Fargate

---

# Without Fargate

YOU manage:

* EC2
* patching
* security
* autoscaling
* AMIs

---

# With Fargate

AWS manages infrastructure.

You only provide:

* image
* CPU
* memory

---

# SECTION 10 — ECS VS ALB

Application Load Balancer distributes incoming traffic.

---

# ALB Responsibilities

ALB:

* receives requests
* routes traffic
* SSL termination
* health checks
* load balancing

---

# ECS Responsibilities

ECS:

* runs containers
* scales containers
* restarts containers

---

# Difference

| ECS                     | ALB                |
| ----------------------- | ------------------ |
| Runs applications       | Routes requests    |
| Manages tasks           | Balances traffic   |
| Orchestrates containers | Handles HTTP/HTTPS |

---

# SECTION 11 — COMPLETE APPLICATION FLOW

Production architecture:



```text id="aq6j7j"
User
 ↓
Route53
 ↓
ALB
 ↓
ECS Service
 ↓
Fargate Tasks
 ↓
Containers
 ↓
Database
Enter fullscreen mode Exit fullscreen mode

Flow Explanation

Step 1

User opens:

```text id="brj6c2"
app.company.com




---

## Step 2

DNS resolves domain.

---

## Step 3

ALB receives request.

---

## Step 4

ALB forwards request to healthy ECS tasks.

---

## Step 5

Container processes request.

---

## Step 6

Response returned.

---

# SECTION 12 — WHY PRIVATE SUBNETS?

Best practice:

| Component | Location       |
| --------- | -------------- |
| ALB       | Public subnet  |
| ECS Tasks | Private subnet |
| Database  | Private subnet |

---

# Why?

Security.

Users should NEVER directly access containers.

Only ALB exposed publicly.

---

# SECTION 13 — ECR

Amazon Elastic Container Registry stores Docker images.

Like DockerHub inside AWS.

---

# Deployment Flow



```text id="08s7d9"
Developer
   ↓
GitHub
   ↓
CI/CD Pipeline
   ↓
Docker Build
   ↓
Push to ECR
   ↓
Update ECS Service
   ↓
Deploy New Tasks
Enter fullscreen mode Exit fullscreen mode

SECTION 14 — HEALTH CHECKS

ALB checks:

```text id="c6rw0d"
/health




If unhealthy:

* ALB stops routing traffic
* ECS replaces task

Self-healing.

---

# SECTION 15 — WHY OBSERVABILITY IS CRITICAL

Production systems fail.

Need answers:

* Why app crashed?
* Why slow?
* Which task failed?
* CPU high?
* Logs errors?
* Memory leak?

This is observability.

---

# Three Pillars of Observability

| Pillar  | Meaning            |
| ------- | ------------------ |
| Metrics | Numerical data     |
| Logs    | Application events |
| Traces  | Request journey    |

---

# SECTION 16 — OBSERVABILITY TOOLS

| Tool          | Purpose                |
| ------------- | ---------------------- |
| Prometheus    | Metrics collection     |
| Grafana       | Visualization          |
| Loki          | Logs storage           |
| Alloy         | Log collection         |
| Node Exporter | Infrastructure metrics |

---

# SECTION 17 — DEMO APPLICATION SERVICE

This is actual business application.

Examples:

* ecommerce app
* API
* frontend
* backend

---

# Why Separate ECS Service?

Microservices architecture.

Example:



```text id="yt8on2"
frontend-service
backend-service
auth-service
payment-service
Enter fullscreen mode Exit fullscreen mode

Each service independently scalable.


SECTION 18 — PROMETHEUS

Prometheus collects metrics.


What Are Metrics?

Numbers representing system state.

Examples:

  • CPU usage
  • memory usage
  • requests/sec
  • latency
  • error rates

Prometheus Pull Model

Prometheus SCRAPES endpoints:

```text id="ikx4fp"
/metrics




---

# Flow



```text id="69rf7s"
Prometheus
     ↓
Application /metrics endpoint
Enter fullscreen mode Exit fullscreen mode

What DevOps Checks

Targets UP/DOWN

Healthy?


Scrape Errors

Metrics accessible?


Query Performance

PromQL functioning?


Storage Usage

Prometheus database healthy?


SECTION 19 — NODE EXPORTER

Prometheus Node Exporter exports OS-level metrics.


Metrics Collected

  • CPU
  • RAM
  • disk
  • filesystem
  • network

Flow

```text id="0n3z2d"
Node Exporter

Prometheus

Grafana




---

# Why Important?

Application failures often caused by:

* disk full
* CPU saturation
* memory exhaustion

Need infrastructure visibility.

---

# SECTION 20 — GRAFANA

Grafana visualizes metrics and logs.

---

# Important

Grafana DOES NOT collect data.

Prometheus collects.

Grafana visualizes.

---

# Grafana Reads

* Prometheus
* Loki

---

# What DevOps Checks

## Dashboards

CPU?
Memory?
Traffic?

---

## Alerts

* high latency
* high CPU
* application down

---

## Traffic Trends

Traffic increase patterns.

---

# SECTION 21 — LOKI

Grafana Loki stores logs centrally.

---

# Log Examples



```text id="4oz1d6"
ERROR database timeout
INFO login successful
FATAL application crashed
Enter fullscreen mode Exit fullscreen mode

Why Loki?

Without Loki:
Need SSH into containers.

Very bad in production.


Loki Architecture

```text id="fj6m9o"
Containers

Alloy

Loki

Grafana




---

# SECTION 22 — ALLOY

Grafana Alloy collects and forwards logs/metrics.

---

# Alloy Responsibilities

Acts like:

* collector
* shipper
* forwarder

---

# Why Needed?

Containers generate logs continuously.

Loki cannot automatically collect all logs directly.

Need collection agent.

---

# Flow



```text id="s3ry1d"
Container Logs
     ↓
Alloy
     ↓
Loki
Enter fullscreen mode Exit fullscreen mode

SECTION 23 — WHY SEPARATE ECS SERVICES?

Production cluster:

```text id="eqwxjg"
production-cluster
├── demo-app-service
├── prometheus-service
├── grafana-service
├── loki-service
├── alloy-service
└── node-exporter-service




---

# Why Separate?

## Independent Scaling

Prometheus may need more CPU.

Loki may need more storage.

---

## Isolation

Grafana crash should not affect application.

---

## Easier Upgrades

Upgrade Grafana independently.

---

## Better Security

Different IAM roles.

---

## Better Resource Allocation

Different memory/CPU.

---

# SECTION 24 — WHAT DEVOPS ENGINEER SETS UP

---

# Networking

* VPC
* subnets
* route tables
* NAT Gateway
* Internet Gateway
* security groups

---

# ECS Infrastructure

* ECS cluster
* task definitions
* ECS services

---

# Load Balancing

* ALB
* target groups
* listeners
* SSL certificates

---

# CI/CD

* Jenkins
* GitHub Actions
* GitLab CI

---

# Containerization

* Dockerfiles
* ECR repositories

---

# Observability

* Prometheus
* Grafana
* Loki
* Alloy
* alerts
* dashboards

---

# Security

* IAM roles
* secrets
* TLS certificates

---

# Scaling

* ECS autoscaling
* CPU policies
* memory policies

---

# SECTION 25 — DAILY DEVOPS MONITORING

---

# ECS

Check:

* task restarts
* unhealthy tasks
* deployment failures

---

# ALB

Check:

* 5XX errors
* latency
* unhealthy targets

---

# Prometheus

Check:

* targets DOWN
* scrape failures

---

# Grafana

Check:

* alerts
* spikes
* trends

---

# Loki

Check:

* errors
* crashes
* timeouts

---

# Alloy

Check:

* forwarding errors
* dropped logs

---

# Node Exporter

Check:

* CPU
* memory
* disk usage
* network traffic

---

# SECTION 26 — REAL INCIDENT TROUBLESHOOTING

Users complain:



```text id="gd12ya"
Website slow
Enter fullscreen mode Exit fullscreen mode

Investigation

Step 1 — Grafana

CPU spike detected.


Step 2 — Prometheus

Latency:

```text id="3zskdh"
5 seconds




---

## Step 3 — Loki

Logs show:



```text id="8m9g6s"
Database timeout
Enter fullscreen mode Exit fullscreen mode

Step 4 — ECS

Tasks restarting repeatedly.


Root Cause

Database connection pool exhausted.


SECTION 27 — ECS ADVANTAGES

Easy AWS Integration

Works perfectly with:

  • IAM
  • Route53
  • CloudWatch
  • ALB
  • ECR

Easier Than Kubernetes

Simpler learning curve.


Self-Healing

Automatic task replacement.


High Availability

Multi-AZ deployments.


Fast Deployments

Rolling updates.


SECTION 28 — ECS DISADVANTAGES

AWS Vendor Lock-In

Mostly AWS ecosystem.


Less Flexible Than Kubernetes

Kubernetes more portable.


Fargate Cost

Can become expensive at scale.


SECTION 29 — WHEN TO USE ECS FARGATE

Best for:

  • APIs
  • microservices
  • startups
  • AWS-native systems
  • small DevOps teams

SECTION 30 — COMPLETE FINAL ARCHITECTURE

```text id="3s4w5g"
Users

Route53

ALB

ECS Cluster

Fargate Tasks

Demo App Containers

─────────────────────────────
Metrics → Prometheus
Logs → Alloy → Loki
Visualization → Grafana
Infrastructure → Node Exporter
─────────────────────────────

Database




---

# FINAL BIG PICTURE

## Docker

Creates containers.

## ECS

Manages containers.

## Fargate

Provides serverless infrastructure.

## ALB

Routes traffic.

## ECR

Stores images.

## Prometheus

Collects metrics.

## Grafana

Visualizes data.

## Loki

Stores logs.

## Alloy

Collects/ships logs.

## Node Exporter

Infrastructure metrics.

---

# MAIN PURPOSE OF THIS ENTIRE SYSTEM

To build:

* scalable systems
* self-healing systems
* highly available systems
* automated deployments
* centralized monitoring
* centralized logging
* reliable production infrastructure

This is how modern DevOps and SRE teams operate production applications in AWS cloud environments.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)