DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

Production Lab: ECS Fargate + Prometheus + Grafana + Loki + Alloy + Node Exporter

Goal

You will build this architecture:

ECS Fargate Application
   |
   | metrics/logs
   v
Alloy sidecar
   |
   | remote_write metrics
   | push logs
   v
EC2 Monitoring Server
   - Prometheus :9090
   - Grafana    :3000
   - Loki       :3100
   - Alloy
   - Node Exporter
Enter fullscreen mode Exit fullscreen mode

Officially, ECS Fargate tasks use task execution roles for ECS actions like pulling images/logging, and task roles for application AWS permissions. (AWS Documentation) Alloy supports ECS/Fargate container metrics using the ECS Task Metadata Endpoint v4 and should run as a sidecar inside the task. (Grafana Labs)


Part 1: What Each Tool Does

Tool What it does Why DevOps/SRE uses it
ECS Runs containers on AWS Deploy microservices
Fargate Serverless container runtime No EC2 patching/management
IAM Role Gives permission securely No hardcoded AWS keys
Prometheus Stores metrics CPU, memory, request rate, errors
Grafana Visual dashboard See health visually
Loki Stores logs Troubleshoot errors
Alloy Collects metrics/logs/traces Modern agent replacing many old agents
Node Exporter Exposes EC2 Linux metrics Monitor EC2 server health

Part 2: EC2 Monitoring Server Check

Your EC2 already has:

Prometheus
Grafana
Node Exporter
Loki
Alloy
Enter fullscreen mode Exit fullscreen mode

Step 1: Check all services

Run on EC2:

sudo systemctl status prometheus
sudo systemctl status grafana-server
sudo systemctl status loki
sudo systemctl status alloy
sudo systemctl status node_exporter
Enter fullscreen mode Exit fullscreen mode

Expected:

active (running)
Enter fullscreen mode Exit fullscreen mode

Why we check this

Before we connect ECS, the central monitoring server must be healthy.

SRE/DevOps checks

DevOps checks:

sudo ss -tulnp | grep -E '3000|9090|9100|3100'
Enter fullscreen mode Exit fullscreen mode

Expected ports:

3000 Grafana
9090 Prometheus
9100 Node Exporter
3100 Loki
Enter fullscreen mode Exit fullscreen mode

SRE checks:

curl http://localhost:9090/-/ready
curl http://localhost:3100/ready
curl http://localhost:9100/metrics
Enter fullscreen mode Exit fullscreen mode

Expected:

Prometheus ready
Loki ready
Node metrics visible
Enter fullscreen mode Exit fullscreen mode

Part 3: Fix Prometheus for Remote Write

Fargate tasks are dynamic. Their private IP changes. So instead of Prometheus scraping every task IP, Alloy inside Fargate will push metrics to Prometheus.

Step 2: Enable Prometheus remote write receiver

Open Prometheus service file:

sudo systemctl edit prometheus
Enter fullscreen mode Exit fullscreen mode

Add:

[Service]
ExecStart=
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.listen-address=:9090 \
  --web.enable-lifecycle \
  --web.enable-remote-write-receiver
Enter fullscreen mode Exit fullscreen mode

Restart:

sudo systemctl daemon-reload
sudo systemctl restart prometheus
sudo systemctl status prometheus
Enter fullscreen mode Exit fullscreen mode

Test:

curl http://localhost:9090/-/ready
Enter fullscreen mode Exit fullscreen mode

Why we do this

Fargate containers cannot easily be scraped by fixed IP because tasks start/stop dynamically. Remote write lets Alloy push metrics to Prometheus.


Part 4: EC2 Security Group

In AWS Console:

Go to:

EC2 → Instances → Select monitoring EC2 → Security → Security Group
Enter fullscreen mode Exit fullscreen mode

Add inbound rules:

Port Source Purpose
3000 Your IP only Grafana UI
9090 VPC CIDR only Prometheus remote write
3100 VPC CIDR only Loki logs
9100 Your IP or VPC only Node Exporter test only

Example VPC CIDR:

10.0.0.0/16
Enter fullscreen mode Exit fullscreen mode

Do not open 9090, 3100, 9100 to 0.0.0.0/0.

Why we do this

Prometheus and Loki do not protect themselves like a public website. Keep them private.


Part 5: Configure Alloy on EC2

Open:

sudo nano /etc/alloy/config.alloy
Enter fullscreen mode Exit fullscreen mode

Use this:

prometheus.exporter.unix "local_host" {
  set_collectors = ["cpu", "meminfo", "diskstats", "filesystem", "netdev", "loadavg"]
}

prometheus.scrape "local_host" {
  targets    = prometheus.exporter.unix.local_host.targets
  forward_to = [prometheus.remote_write.local_prom.receiver]
}

prometheus.remote_write "local_prom" {
  endpoint {
    url = "http://127.0.0.1:9090/api/v1/write"
  }
}

loki.source.file "system_logs" {
  targets = [
    {__path__ = "/var/log/syslog", job = "syslog"},
    {__path__ = "/var/log/auth.log", job = "auth"},
    {__path__ = "/var/log/nginx/access.log", job = "nginx_access"},
    {__path__ = "/var/log/nginx/error.log", job = "nginx_error"},
  ]
  forward_to = [loki.write.local_loki.receiver]
}

loki.write "local_loki" {
  endpoint {
    url = "http://127.0.0.1:3100/loki/api/v1/push"
  }
}
Enter fullscreen mode Exit fullscreen mode

Restart:

sudo alloy fmt --write /etc/alloy/config.alloy
sudo systemctl restart alloy
sudo systemctl status alloy
Enter fullscreen mode Exit fullscreen mode

Important correction

Use:

127.0.0.1
Enter fullscreen mode Exit fullscreen mode

Not:

123.0.0.1
Enter fullscreen mode Exit fullscreen mode

Part 6: Create ECS IAM Roles

Role 1: ECS Task Execution Role

AWS Console:

IAM → Roles → Create role → AWS service → Elastic Container Service → ECS Task
Enter fullscreen mode Exit fullscreen mode

Attach:

AmazonECSTaskExecutionRolePolicy
Enter fullscreen mode Exit fullscreen mode

Name:

ecsTaskExecutionRole
Enter fullscreen mode Exit fullscreen mode

Why

This allows ECS/Fargate to:

Pull image from ECR
Send logs to CloudWatch
Read Secrets Manager if needed
Enter fullscreen mode Exit fullscreen mode

Role 2: ECS Task Role

Create another role:

IAM → Roles → Create role → ECS Task
Enter fullscreen mode Exit fullscreen mode

Name:

ecsAppTaskRole
Enter fullscreen mode Exit fullscreen mode

For this lab, start with no extra permissions.

If app needs S3 later, add only exact S3 permissions.

Why

Task role is for your application container, not ECS itself.


Part 7: Create ECS Cluster

AWS Console:

ECS → Clusters → Create cluster
Enter fullscreen mode Exit fullscreen mode

Choose:

AWS Fargate
Enter fullscreen mode Exit fullscreen mode

Name:

prod-observability-cluster
Enter fullscreen mode Exit fullscreen mode

Click:

Create
Enter fullscreen mode Exit fullscreen mode

Why

Cluster is the logical place where ECS services/tasks run.


Part 8: Create Simple Application Container

For easiest lab, use a demo app that exposes Prometheus metrics on port 8080.

Example image:

ghcr.io/brancz/prometheus-example-app:v0.5.0
Enter fullscreen mode Exit fullscreen mode

It exposes:

/metrics
Enter fullscreen mode Exit fullscreen mode

Port:

8080
Enter fullscreen mode Exit fullscreen mode

Part 9: Create Fargate Task Definition

Go to:

ECS → Task Definitions → Create new task definition → Create new task definition with JSON
Enter fullscreen mode Exit fullscreen mode

Use this template:

{
  "family": "fargate-observability-lab",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsAppTaskRole",
  "containerDefinitions": [
    {
      "name": "demo-app",
      "image": "ghcr.io/brancz/prometheus-example-app:v0.5.0",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/fargate-observability-lab",
          "awslogs-region": "us-east-2",
          "awslogs-stream-prefix": "demo-app",
          "awslogs-create-group": "true"
        }
      }
    },
    {
      "name": "alloy-sidecar",
      "image": "grafana/alloy:latest",
      "essential": false,
      "command": [
        "run",
        "--server.http.listen-addr=0.0.0.0:12345",
        "/etc/alloy/fargate.alloy"
      ],
      "environment": [
        {
          "name": "ALLOY_STABILITY_LEVEL",
          "value": "experimental"
        },
        {
          "name": "EC2_PROMETHEUS_URL",
          "value": "http://<EC2_PRIVATE_IP>:9090/api/v1/write"
        },
        {
          "name": "EC2_LOKI_URL",
          "value": "http://<EC2_PRIVATE_IP>:3100/loki/api/v1/push"
        }
      ],
      "portMappings": [
        {
          "containerPort": 12345,
          "protocol": "tcp"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/fargate-observability-lab",
          "awslogs-region": "us-east-2",
          "awslogs-stream-prefix": "alloy",
          "awslogs-create-group": "true"
        }
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Replace:

<ACCOUNT_ID>
<EC2_PRIVATE_IP>
us-east-2 if your region is different
Enter fullscreen mode Exit fullscreen mode

Important note

For a real production setup, store Alloy config in:

EFS
S3 pulled at startup
custom Alloy image
Enter fullscreen mode Exit fullscreen mode

For class/demo, custom Alloy image is easiest.


Part 10: Alloy Fargate Config

Create file:

fargate.alloy
Enter fullscreen mode Exit fullscreen mode

Content:

prometheus.scrape "app_metrics" {
  targets = [
    {"__address__" = "127.0.0.1:8080", "job" = "demo-app"}
  ]

  forward_to = [prometheus.remote_write.ec2_prometheus.receiver]
}

otelcol.receiver.awsecscontainermetrics "fargate_metrics" {
  collection_interval = "30s"

  output {
    metrics = [otelcol.exporter.prometheus.fargate_to_prom.receiver]
  }
}

otelcol.exporter.prometheus "fargate_to_prom" {
  forward_to = [prometheus.remote_write.ec2_prometheus.receiver]
}

prometheus.remote_write "ec2_prometheus" {
  endpoint {
    url = env("EC2_PROMETHEUS_URL")
  }
}
Enter fullscreen mode Exit fullscreen mode

Why

This collects:

Application /metrics
Fargate task CPU
Fargate task memory
Container-level metrics
Enter fullscreen mode Exit fullscreen mode

Part 11: Run ECS Service

Go to:

ECS → Clusters → prod-observability-cluster → Services → Create
Enter fullscreen mode Exit fullscreen mode

Choose:

Launch type: Fargate
Task definition: fargate-observability-lab
Service name: demo-app-service
Desired tasks: 1
Enter fullscreen mode Exit fullscreen mode

Networking:

VPC: same VPC as EC2 monitoring server
Subnets: private subnets preferred
Security group: allow outbound to EC2 private IP ports 9090 and 3100
Public IP: disabled if private subnet has NAT
Enter fullscreen mode Exit fullscreen mode

Click:

Create
Enter fullscreen mode Exit fullscreen mode

What to check

Go to:

ECS → Cluster → Service → Tasks
Enter fullscreen mode Exit fullscreen mode

Expected:

Task status: Running
Containers: demo-app running, alloy-sidecar running
Enter fullscreen mode Exit fullscreen mode

Part 12: Verify in Prometheus

Open:

http://<EC2_PUBLIC_IP>:9090
Enter fullscreen mode Exit fullscreen mode

Go to:

Status → TSDB Status
Enter fullscreen mode Exit fullscreen mode

Then search in Graph:

up
Enter fullscreen mode Exit fullscreen mode

Check Alloy internal metrics:

alloy_component_controller_running_components
Enter fullscreen mode Exit fullscreen mode

Check EC2 CPU:

rate(node_cpu_seconds_total[5m])
Enter fullscreen mode Exit fullscreen mode

Check EC2 memory:

node_memory_MemAvailable_bytes
Enter fullscreen mode Exit fullscreen mode

Check app request metrics:

http_requests_total
Enter fullscreen mode Exit fullscreen mode

Check Fargate container metrics:

ecs_task_memory_utilized
Enter fullscreen mode Exit fullscreen mode

or:

container_memory_usage_bytes
Enter fullscreen mode Exit fullscreen mode

Metric names may vary depending on Alloy/OpenTelemetry conversion.


Part 13: Verify in Grafana

Open:

http://<EC2_PUBLIC_IP>:3000
Enter fullscreen mode Exit fullscreen mode

Go to:

Connections → Data sources
Enter fullscreen mode Exit fullscreen mode

Add Prometheus:

URL: http://localhost:9090
Enter fullscreen mode Exit fullscreen mode

Add Loki:

URL: http://localhost:3100
Enter fullscreen mode Exit fullscreen mode

Click:

Save & test
Enter fullscreen mode Exit fullscreen mode

Expected:

Data source is working
Enter fullscreen mode Exit fullscreen mode

Part 14: Grafana Explore Queries

Go to:

Grafana → Explore → Prometheus
Enter fullscreen mode Exit fullscreen mode

Use:

up
Enter fullscreen mode Exit fullscreen mode
rate(node_network_receive_bytes_total[1m])
Enter fullscreen mode Exit fullscreen mode
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)
Enter fullscreen mode Exit fullscreen mode
rate(http_requests_total[5m])
Enter fullscreen mode Exit fullscreen mode

Go to:

Grafana → Explore → Loki
Enter fullscreen mode Exit fullscreen mode

Use:

{job="syslog"}
Enter fullscreen mode Exit fullscreen mode
{job="auth"}
Enter fullscreen mode Exit fullscreen mode
{job="nginx_access"}
Enter fullscreen mode Exit fullscreen mode

For ECS logs, first check CloudWatch logs:

CloudWatch → Log groups → /ecs/fargate-observability-lab
Enter fullscreen mode Exit fullscreen mode

Part 15: What SRE Must Monitor

1. EC2 monitoring server health

100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)
Enter fullscreen mode Exit fullscreen mode

Alert if:

Memory > 85%
Enter fullscreen mode Exit fullscreen mode

Why:

If monitoring server dies, you lose visibility.
Enter fullscreen mode Exit fullscreen mode

2. Disk usage

100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"})
Enter fullscreen mode Exit fullscreen mode

Alert if:

Disk > 80%
Enter fullscreen mode Exit fullscreen mode

Why:

Prometheus and Loki can fill disk quickly.
Enter fullscreen mode Exit fullscreen mode

3. Fargate task memory

ecs_task_memory_utilized / ecs_task_memory_reserved * 100
Enter fullscreen mode Exit fullscreen mode

Alert if:

> 85% for 3 minutes
Enter fullscreen mode Exit fullscreen mode

Why:

Fargate kills containers when memory limit is reached.
Enter fullscreen mode Exit fullscreen mode

4. Application request rate

sum(rate(http_requests_total[5m]))
Enter fullscreen mode Exit fullscreen mode

Why:

If traffic drops to zero, app or routing may be broken.
Enter fullscreen mode Exit fullscreen mode

5. Error rate

sum(rate(http_requests_total{code=~"5.."}[5m]))
Enter fullscreen mode Exit fullscreen mode

Why:

5xx errors show application or dependency failure.
Enter fullscreen mode Exit fullscreen mode

Part 16: What DevOps Must Check

DevOps engineer checks:

1. IAM roles are correct
2. ECS task is running
3. Security groups allow only needed ports
4. Fargate can reach EC2 private IP
5. Prometheus remote write is enabled
6. Loki is receiving logs
7. Grafana data sources work
8. No public access to Prometheus/Loki/Node Exporter
9. ECS service has desired count = running count
10. CloudWatch logs exist for both containers
Enter fullscreen mode Exit fullscreen mode

Part 17: Troubleshooting

Problem: ECS task running but no metrics

Check Alloy logs:

ECS → Task → alloy-sidecar → Logs
Enter fullscreen mode Exit fullscreen mode

Look for:

connection refused
timeout
remote write failed
Enter fullscreen mode Exit fullscreen mode

Common causes:

EC2 security group blocks port 9090
Wrong EC2 private IP
Prometheus remote write receiver not enabled
Alloy config error
Enter fullscreen mode Exit fullscreen mode

Problem: Grafana shows no Loki logs

Check:

curl http://localhost:3100/ready
sudo journalctl -u alloy -f
sudo journalctl -u loki -f
Enter fullscreen mode Exit fullscreen mode

Common causes:

Loki not running
Wrong Loki URL
Alloy cannot read log files
No permissions on /var/log/*
Enter fullscreen mode Exit fullscreen mode

Problem: Node Exporter works but Fargate metrics missing

Cause:

Node Exporter monitors EC2 only.
It cannot monitor Fargate hosts.
Enter fullscreen mode Exit fullscreen mode

Correct approach:

Use Alloy sidecar with ECS container metrics receiver.
Enter fullscreen mode Exit fullscreen mode

Final Teaching Summary

This lab demonstrates a real DevOps/SRE production pattern:

ECS Fargate runs application containers.
IAM secures container permissions.
Alloy collects telemetry.
Prometheus stores metrics.
Loki stores logs.
Grafana visualizes everything.
Node Exporter monitors the EC2 monitoring server.
Enter fullscreen mode Exit fullscreen mode

The most important SRE mindset:

Metrics tell you what is happening.
Logs tell you why it happened.
Grafana helps you see the story.
IAM and security groups control who can access what.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)