Aisalkyn Aidarova

Posted on May 25

Production Lab: ECS Fargate + Prometheus + Grafana + Loki + Alloy + Node Exporter

#monitoring #devops #aws #tutorial

Goal

You will build this architecture:

ECS Fargate Application
   |
   | metrics/logs
   v
Alloy sidecar
   |
   | remote_write metrics
   | push logs
   v
EC2 Monitoring Server
   - Prometheus :9090
   - Grafana    :3000
   - Loki       :3100
   - Alloy
   - Node Exporter

Officially, ECS Fargate tasks use task execution roles for ECS actions like pulling images/logging, and task roles for application AWS permissions. (AWS Documentation) Alloy supports ECS/Fargate container metrics using the ECS Task Metadata Endpoint v4 and should run as a sidecar inside the task. (Grafana Labs)

Part 1: What Each Tool Does

Tool	What it does	Why DevOps/SRE uses it
ECS	Runs containers on AWS	Deploy microservices
Fargate	Serverless container runtime	No EC2 patching/management
IAM Role	Gives permission securely	No hardcoded AWS keys
Prometheus	Stores metrics	CPU, memory, request rate, errors
Grafana	Visual dashboard	See health visually
Loki	Stores logs	Troubleshoot errors
Alloy	Collects metrics/logs/traces	Modern agent replacing many old agents
Node Exporter	Exposes EC2 Linux metrics	Monitor EC2 server health

Part 2: EC2 Monitoring Server Check

Your EC2 already has:

Prometheus
Grafana
Node Exporter
Loki
Alloy

Step 1: Check all services

Run on EC2:

sudo systemctl status prometheus
sudo systemctl status grafana-server
sudo systemctl status loki
sudo systemctl status alloy
sudo systemctl status node_exporter

Expected:

active (running)

Why we check this

Before we connect ECS, the central monitoring server must be healthy.

SRE/DevOps checks

DevOps checks:

sudo ss -tulnp | grep -E '3000|9090|9100|3100'

Expected ports:

3000 Grafana
9090 Prometheus
9100 Node Exporter
3100 Loki

SRE checks:

curl http://localhost:9090/-/ready
curl http://localhost:3100/ready
curl http://localhost:9100/metrics

Expected:

Prometheus ready
Loki ready
Node metrics visible

Part 3: Fix Prometheus for Remote Write

Fargate tasks are dynamic. Their private IP changes. So instead of Prometheus scraping every task IP, Alloy inside Fargate will push metrics to Prometheus.

Step 2: Enable Prometheus remote write receiver

Open Prometheus service file:

sudo systemctl edit prometheus

Add:

[Service]
ExecStart=
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.listen-address=:9090 \
  --web.enable-lifecycle \
  --web.enable-remote-write-receiver

Restart:

sudo systemctl daemon-reload
sudo systemctl restart prometheus
sudo systemctl status prometheus

Test:

curl http://localhost:9090/-/ready

Why we do this

Fargate containers cannot easily be scraped by fixed IP because tasks start/stop dynamically. Remote write lets Alloy push metrics to Prometheus.

Part 4: EC2 Security Group

In AWS Console:

Go to:

EC2 → Instances → Select monitoring EC2 → Security → Security Group

Add inbound rules:

Port	Source	Purpose
3000	Your IP only	Grafana UI
9090	VPC CIDR only	Prometheus remote write
3100	VPC CIDR only	Loki logs
9100	Your IP or VPC only	Node Exporter test only

Example VPC CIDR:

10.0.0.0/16

Do not open 9090, 3100, 9100 to 0.0.0.0/0.

Why we do this

Prometheus and Loki do not protect themselves like a public website. Keep them private.

Part 5: Configure Alloy on EC2

Open:

sudo nano /etc/alloy/config.alloy

Use this:

prometheus.exporter.unix "local_host" {
  set_collectors = ["cpu", "meminfo", "diskstats", "filesystem", "netdev", "loadavg"]
}

prometheus.scrape "local_host" {
  targets    = prometheus.exporter.unix.local_host.targets
  forward_to = [prometheus.remote_write.local_prom.receiver]
}

prometheus.remote_write "local_prom" {
  endpoint {
    url = "http://127.0.0.1:9090/api/v1/write"
  }
}

loki.source.file "system_logs" {
  targets = [
    {__path__ = "/var/log/syslog", job = "syslog"},
    {__path__ = "/var/log/auth.log", job = "auth"},
    {__path__ = "/var/log/nginx/access.log", job = "nginx_access"},
    {__path__ = "/var/log/nginx/error.log", job = "nginx_error"},
  ]
  forward_to = [loki.write.local_loki.receiver]
}

loki.write "local_loki" {
  endpoint {
    url = "http://127.0.0.1:3100/loki/api/v1/push"
  }
}

Restart:

sudo alloy fmt --write /etc/alloy/config.alloy
sudo systemctl restart alloy
sudo systemctl status alloy

Important correction

Use:

127.0.0.1

Not:

123.0.0.1

Part 6: Create ECS IAM Roles

Role 1: ECS Task Execution Role

AWS Console:

IAM → Roles → Create role → AWS service → Elastic Container Service → ECS Task

Attach:

AmazonECSTaskExecutionRolePolicy

Name:

ecsTaskExecutionRole

Why

This allows ECS/Fargate to:

Pull image from ECR
Send logs to CloudWatch
Read Secrets Manager if needed

Role 2: ECS Task Role

Create another role:

IAM → Roles → Create role → ECS Task

Name:

ecsAppTaskRole

For this lab, start with no extra permissions.

If app needs S3 later, add only exact S3 permissions.

Why

Task role is for your application container, not ECS itself.

Part 7: Create ECS Cluster

AWS Console:

ECS → Clusters → Create cluster

Choose:

AWS Fargate

Name:

prod-observability-cluster

Click:

Create

Why

Cluster is the logical place where ECS services/tasks run.

Part 8: Create Simple Application Container

For easiest lab, use a demo app that exposes Prometheus metrics on port 8080.

Example image:

ghcr.io/brancz/prometheus-example-app:v0.5.0

It exposes:

/metrics

Port:

Part 9: Create Fargate Task Definition

Go to:

ECS → Task Definitions → Create new task definition → Create new task definition with JSON

Use this template:

{
  "family": "fargate-observability-lab",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsAppTaskRole",
  "containerDefinitions": [
    {
      "name": "demo-app",
      "image": "ghcr.io/brancz/prometheus-example-app:v0.5.0",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/fargate-observability-lab",
          "awslogs-region": "us-east-2",
          "awslogs-stream-prefix": "demo-app",
          "awslogs-create-group": "true"
        }
      }
    },
    {
      "name": "alloy-sidecar",
      "image": "grafana/alloy:latest",
      "essential": false,
      "command": [
        "run",
        "--server.http.listen-addr=0.0.0.0:12345",
        "/etc/alloy/fargate.alloy"
      ],
      "environment": [
        {
          "name": "ALLOY_STABILITY_LEVEL",
          "value": "experimental"
        },
        {
          "name": "EC2_PROMETHEUS_URL",
          "value": "http://<EC2_PRIVATE_IP>:9090/api/v1/write"
        },
        {
          "name": "EC2_LOKI_URL",
          "value": "http://<EC2_PRIVATE_IP>:3100/loki/api/v1/push"
        }
      ],
      "portMappings": [
        {
          "containerPort": 12345,
          "protocol": "tcp"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/fargate-observability-lab",
          "awslogs-region": "us-east-2",
          "awslogs-stream-prefix": "alloy",
          "awslogs-create-group": "true"
        }
      }
    }
  ]
}

Replace:

<ACCOUNT_ID>
<EC2_PRIVATE_IP>
us-east-2 if your region is different

Important note

For a real production setup, store Alloy config in:

EFS
S3 pulled at startup
custom Alloy image

For class/demo, custom Alloy image is easiest.

Part 10: Alloy Fargate Config

Create file:

fargate.alloy

Content:

prometheus.scrape "app_metrics" {
  targets = [
    {"__address__" = "127.0.0.1:8080", "job" = "demo-app"}
  ]

  forward_to = [prometheus.remote_write.ec2_prometheus.receiver]
}

otelcol.receiver.awsecscontainermetrics "fargate_metrics" {
  collection_interval = "30s"

  output {
    metrics = [otelcol.exporter.prometheus.fargate_to_prom.receiver]
  }
}

otelcol.exporter.prometheus "fargate_to_prom" {
  forward_to = [prometheus.remote_write.ec2_prometheus.receiver]
}

prometheus.remote_write "ec2_prometheus" {
  endpoint {
    url = env("EC2_PROMETHEUS_URL")
  }
}

Why

This collects:

Application /metrics
Fargate task CPU
Fargate task memory
Container-level metrics

Part 11: Run ECS Service

Go to:

ECS → Clusters → prod-observability-cluster → Services → Create

Choose:

Launch type: Fargate
Task definition: fargate-observability-lab
Service name: demo-app-service
Desired tasks: 1

Networking:

VPC: same VPC as EC2 monitoring server
Subnets: private subnets preferred
Security group: allow outbound to EC2 private IP ports 9090 and 3100
Public IP: disabled if private subnet has NAT

Click:

Create

What to check

Go to:

ECS → Cluster → Service → Tasks

Expected:

Task status: Running
Containers: demo-app running, alloy-sidecar running

Part 12: Verify in Prometheus

Open:

http://<EC2_PUBLIC_IP>:9090

Go to:

Status → TSDB Status

Then search in Graph:

up

Check Alloy internal metrics:

alloy_component_controller_running_components

Check EC2 CPU:

rate(node_cpu_seconds_total[5m])

Check EC2 memory:

node_memory_MemAvailable_bytes

Check app request metrics:

http_requests_total

Check Fargate container metrics:

ecs_task_memory_utilized

or:

container_memory_usage_bytes

Metric names may vary depending on Alloy/OpenTelemetry conversion.

Part 13: Verify in Grafana

Open:

http://<EC2_PUBLIC_IP>:3000

Go to:

Connections → Data sources

Add Prometheus:

URL: http://localhost:9090

Add Loki:

URL: http://localhost:3100

Click:

Save & test

Expected:

Data source is working

Part 14: Grafana Explore Queries

Go to:

Grafana → Explore → Prometheus

Use:

up

rate(node_network_receive_bytes_total[1m])

100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)

rate(http_requests_total[5m])

Go to:

Grafana → Explore → Loki

Use:

{job="syslog"}

{job="auth"}

{job="nginx_access"}

For ECS logs, first check CloudWatch logs:

CloudWatch → Log groups → /ecs/fargate-observability-lab

Part 15: What SRE Must Monitor

1. EC2 monitoring server health

100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)

Alert if:

Memory > 85%

Why:

If monitoring server dies, you lose visibility.

2. Disk usage

100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"})

Alert if:

Disk > 80%

Why:

Prometheus and Loki can fill disk quickly.

3. Fargate task memory

ecs_task_memory_utilized / ecs_task_memory_reserved * 100

Alert if:

> 85% for 3 minutes

Why:

Fargate kills containers when memory limit is reached.

4. Application request rate

sum(rate(http_requests_total[5m]))

Why:

If traffic drops to zero, app or routing may be broken.

5. Error rate

sum(rate(http_requests_total{code=~"5.."}[5m]))

Why:

5xx errors show application or dependency failure.

Part 16: What DevOps Must Check

DevOps engineer checks:

1. IAM roles are correct
2. ECS task is running
3. Security groups allow only needed ports
4. Fargate can reach EC2 private IP
5. Prometheus remote write is enabled
6. Loki is receiving logs
7. Grafana data sources work
8. No public access to Prometheus/Loki/Node Exporter
9. ECS service has desired count = running count
10. CloudWatch logs exist for both containers

Part 17: Troubleshooting

Problem: ECS task running but no metrics

Check Alloy logs:

ECS → Task → alloy-sidecar → Logs

Look for:

connection refused
timeout
remote write failed

Common causes:

EC2 security group blocks port 9090
Wrong EC2 private IP
Prometheus remote write receiver not enabled
Alloy config error

Problem: Grafana shows no Loki logs

Check:

curl http://localhost:3100/ready
sudo journalctl -u alloy -f
sudo journalctl -u loki -f

Common causes:

Loki not running
Wrong Loki URL
Alloy cannot read log files
No permissions on /var/log/*

Problem: Node Exporter works but Fargate metrics missing

Cause:

Node Exporter monitors EC2 only.
It cannot monitor Fargate hosts.

Correct approach:

Use Alloy sidecar with ECS container metrics receiver.

Final Teaching Summary

This lab demonstrates a real DevOps/SRE production pattern:

ECS Fargate runs application containers.
IAM secures container permissions.
Alloy collects telemetry.
Prometheus stores metrics.
Loki stores logs.
Grafana visualizes everything.
Node Exporter monitors the EC2 monitoring server.

The most important SRE mindset:

Metrics tell you what is happening.
Logs tell you why it happened.
Grafana helps you see the story.
IAM and security groups control who can access what.