Goal
You will build this architecture:
ECS Fargate Application
|
| metrics/logs
v
Alloy sidecar
|
| remote_write metrics
| push logs
v
EC2 Monitoring Server
- Prometheus :9090
- Grafana :3000
- Loki :3100
- Alloy
- Node Exporter
Officially, ECS Fargate tasks use task execution roles for ECS actions like pulling images/logging, and task roles for application AWS permissions. (AWS Documentation) Alloy supports ECS/Fargate container metrics using the ECS Task Metadata Endpoint v4 and should run as a sidecar inside the task. (Grafana Labs)
Part 1: What Each Tool Does
| Tool | What it does | Why DevOps/SRE uses it |
|---|---|---|
| ECS | Runs containers on AWS | Deploy microservices |
| Fargate | Serverless container runtime | No EC2 patching/management |
| IAM Role | Gives permission securely | No hardcoded AWS keys |
| Prometheus | Stores metrics | CPU, memory, request rate, errors |
| Grafana | Visual dashboard | See health visually |
| Loki | Stores logs | Troubleshoot errors |
| Alloy | Collects metrics/logs/traces | Modern agent replacing many old agents |
| Node Exporter | Exposes EC2 Linux metrics | Monitor EC2 server health |
Part 2: EC2 Monitoring Server Check
Your EC2 already has:
Prometheus
Grafana
Node Exporter
Loki
Alloy
Step 1: Check all services
Run on EC2:
sudo systemctl status prometheus
sudo systemctl status grafana-server
sudo systemctl status loki
sudo systemctl status alloy
sudo systemctl status node_exporter
Expected:
active (running)
Why we check this
Before we connect ECS, the central monitoring server must be healthy.
SRE/DevOps checks
DevOps checks:
sudo ss -tulnp | grep -E '3000|9090|9100|3100'
Expected ports:
3000 Grafana
9090 Prometheus
9100 Node Exporter
3100 Loki
SRE checks:
curl http://localhost:9090/-/ready
curl http://localhost:3100/ready
curl http://localhost:9100/metrics
Expected:
Prometheus ready
Loki ready
Node metrics visible
Part 3: Fix Prometheus for Remote Write
Fargate tasks are dynamic. Their private IP changes. So instead of Prometheus scraping every task IP, Alloy inside Fargate will push metrics to Prometheus.
Step 2: Enable Prometheus remote write receiver
Open Prometheus service file:
sudo systemctl edit prometheus
Add:
[Service]
ExecStart=
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--web.listen-address=:9090 \
--web.enable-lifecycle \
--web.enable-remote-write-receiver
Restart:
sudo systemctl daemon-reload
sudo systemctl restart prometheus
sudo systemctl status prometheus
Test:
curl http://localhost:9090/-/ready
Why we do this
Fargate containers cannot easily be scraped by fixed IP because tasks start/stop dynamically. Remote write lets Alloy push metrics to Prometheus.
Part 4: EC2 Security Group
In AWS Console:
Go to:
EC2 → Instances → Select monitoring EC2 → Security → Security Group
Add inbound rules:
| Port | Source | Purpose |
|---|---|---|
| 3000 | Your IP only | Grafana UI |
| 9090 | VPC CIDR only | Prometheus remote write |
| 3100 | VPC CIDR only | Loki logs |
| 9100 | Your IP or VPC only | Node Exporter test only |
Example VPC CIDR:
10.0.0.0/16
Do not open 9090, 3100, 9100 to 0.0.0.0/0.
Why we do this
Prometheus and Loki do not protect themselves like a public website. Keep them private.
Part 5: Configure Alloy on EC2
Open:
sudo nano /etc/alloy/config.alloy
Use this:
prometheus.exporter.unix "local_host" {
set_collectors = ["cpu", "meminfo", "diskstats", "filesystem", "netdev", "loadavg"]
}
prometheus.scrape "local_host" {
targets = prometheus.exporter.unix.local_host.targets
forward_to = [prometheus.remote_write.local_prom.receiver]
}
prometheus.remote_write "local_prom" {
endpoint {
url = "http://127.0.0.1:9090/api/v1/write"
}
}
loki.source.file "system_logs" {
targets = [
{__path__ = "/var/log/syslog", job = "syslog"},
{__path__ = "/var/log/auth.log", job = "auth"},
{__path__ = "/var/log/nginx/access.log", job = "nginx_access"},
{__path__ = "/var/log/nginx/error.log", job = "nginx_error"},
]
forward_to = [loki.write.local_loki.receiver]
}
loki.write "local_loki" {
endpoint {
url = "http://127.0.0.1:3100/loki/api/v1/push"
}
}
Restart:
sudo alloy fmt --write /etc/alloy/config.alloy
sudo systemctl restart alloy
sudo systemctl status alloy
Important correction
Use:
127.0.0.1
Not:
123.0.0.1
Part 6: Create ECS IAM Roles
Role 1: ECS Task Execution Role
AWS Console:
IAM → Roles → Create role → AWS service → Elastic Container Service → ECS Task
Attach:
AmazonECSTaskExecutionRolePolicy
Name:
ecsTaskExecutionRole
Why
This allows ECS/Fargate to:
Pull image from ECR
Send logs to CloudWatch
Read Secrets Manager if needed
Role 2: ECS Task Role
Create another role:
IAM → Roles → Create role → ECS Task
Name:
ecsAppTaskRole
For this lab, start with no extra permissions.
If app needs S3 later, add only exact S3 permissions.
Why
Task role is for your application container, not ECS itself.
Part 7: Create ECS Cluster
AWS Console:
ECS → Clusters → Create cluster
Choose:
AWS Fargate
Name:
prod-observability-cluster
Click:
Create
Why
Cluster is the logical place where ECS services/tasks run.
Part 8: Create Simple Application Container
For easiest lab, use a demo app that exposes Prometheus metrics on port 8080.
Example image:
ghcr.io/brancz/prometheus-example-app:v0.5.0
It exposes:
/metrics
Port:
8080
Part 9: Create Fargate Task Definition
Go to:
ECS → Task Definitions → Create new task definition → Create new task definition with JSON
Use this template:
{
"family": "fargate-observability-lab",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsAppTaskRole",
"containerDefinitions": [
{
"name": "demo-app",
"image": "ghcr.io/brancz/prometheus-example-app:v0.5.0",
"essential": true,
"portMappings": [
{
"containerPort": 8080,
"protocol": "tcp"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/fargate-observability-lab",
"awslogs-region": "us-east-2",
"awslogs-stream-prefix": "demo-app",
"awslogs-create-group": "true"
}
}
},
{
"name": "alloy-sidecar",
"image": "grafana/alloy:latest",
"essential": false,
"command": [
"run",
"--server.http.listen-addr=0.0.0.0:12345",
"/etc/alloy/fargate.alloy"
],
"environment": [
{
"name": "ALLOY_STABILITY_LEVEL",
"value": "experimental"
},
{
"name": "EC2_PROMETHEUS_URL",
"value": "http://<EC2_PRIVATE_IP>:9090/api/v1/write"
},
{
"name": "EC2_LOKI_URL",
"value": "http://<EC2_PRIVATE_IP>:3100/loki/api/v1/push"
}
],
"portMappings": [
{
"containerPort": 12345,
"protocol": "tcp"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/fargate-observability-lab",
"awslogs-region": "us-east-2",
"awslogs-stream-prefix": "alloy",
"awslogs-create-group": "true"
}
}
}
]
}
Replace:
<ACCOUNT_ID>
<EC2_PRIVATE_IP>
us-east-2 if your region is different
Important note
For a real production setup, store Alloy config in:
EFS
S3 pulled at startup
custom Alloy image
For class/demo, custom Alloy image is easiest.
Part 10: Alloy Fargate Config
Create file:
fargate.alloy
Content:
prometheus.scrape "app_metrics" {
targets = [
{"__address__" = "127.0.0.1:8080", "job" = "demo-app"}
]
forward_to = [prometheus.remote_write.ec2_prometheus.receiver]
}
otelcol.receiver.awsecscontainermetrics "fargate_metrics" {
collection_interval = "30s"
output {
metrics = [otelcol.exporter.prometheus.fargate_to_prom.receiver]
}
}
otelcol.exporter.prometheus "fargate_to_prom" {
forward_to = [prometheus.remote_write.ec2_prometheus.receiver]
}
prometheus.remote_write "ec2_prometheus" {
endpoint {
url = env("EC2_PROMETHEUS_URL")
}
}
Why
This collects:
Application /metrics
Fargate task CPU
Fargate task memory
Container-level metrics
Part 11: Run ECS Service
Go to:
ECS → Clusters → prod-observability-cluster → Services → Create
Choose:
Launch type: Fargate
Task definition: fargate-observability-lab
Service name: demo-app-service
Desired tasks: 1
Networking:
VPC: same VPC as EC2 monitoring server
Subnets: private subnets preferred
Security group: allow outbound to EC2 private IP ports 9090 and 3100
Public IP: disabled if private subnet has NAT
Click:
Create
What to check
Go to:
ECS → Cluster → Service → Tasks
Expected:
Task status: Running
Containers: demo-app running, alloy-sidecar running
Part 12: Verify in Prometheus
Open:
http://<EC2_PUBLIC_IP>:9090
Go to:
Status → TSDB Status
Then search in Graph:
up
Check Alloy internal metrics:
alloy_component_controller_running_components
Check EC2 CPU:
rate(node_cpu_seconds_total[5m])
Check EC2 memory:
node_memory_MemAvailable_bytes
Check app request metrics:
http_requests_total
Check Fargate container metrics:
ecs_task_memory_utilized
or:
container_memory_usage_bytes
Metric names may vary depending on Alloy/OpenTelemetry conversion.
Part 13: Verify in Grafana
Open:
http://<EC2_PUBLIC_IP>:3000
Go to:
Connections → Data sources
Add Prometheus:
URL: http://localhost:9090
Add Loki:
URL: http://localhost:3100
Click:
Save & test
Expected:
Data source is working
Part 14: Grafana Explore Queries
Go to:
Grafana → Explore → Prometheus
Use:
up
rate(node_network_receive_bytes_total[1m])
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)
rate(http_requests_total[5m])
Go to:
Grafana → Explore → Loki
Use:
{job="syslog"}
{job="auth"}
{job="nginx_access"}
For ECS logs, first check CloudWatch logs:
CloudWatch → Log groups → /ecs/fargate-observability-lab
Part 15: What SRE Must Monitor
1. EC2 monitoring server health
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)
Alert if:
Memory > 85%
Why:
If monitoring server dies, you lose visibility.
2. Disk usage
100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"})
Alert if:
Disk > 80%
Why:
Prometheus and Loki can fill disk quickly.
3. Fargate task memory
ecs_task_memory_utilized / ecs_task_memory_reserved * 100
Alert if:
> 85% for 3 minutes
Why:
Fargate kills containers when memory limit is reached.
4. Application request rate
sum(rate(http_requests_total[5m]))
Why:
If traffic drops to zero, app or routing may be broken.
5. Error rate
sum(rate(http_requests_total{code=~"5.."}[5m]))
Why:
5xx errors show application or dependency failure.
Part 16: What DevOps Must Check
DevOps engineer checks:
1. IAM roles are correct
2. ECS task is running
3. Security groups allow only needed ports
4. Fargate can reach EC2 private IP
5. Prometheus remote write is enabled
6. Loki is receiving logs
7. Grafana data sources work
8. No public access to Prometheus/Loki/Node Exporter
9. ECS service has desired count = running count
10. CloudWatch logs exist for both containers
Part 17: Troubleshooting
Problem: ECS task running but no metrics
Check Alloy logs:
ECS → Task → alloy-sidecar → Logs
Look for:
connection refused
timeout
remote write failed
Common causes:
EC2 security group blocks port 9090
Wrong EC2 private IP
Prometheus remote write receiver not enabled
Alloy config error
Problem: Grafana shows no Loki logs
Check:
curl http://localhost:3100/ready
sudo journalctl -u alloy -f
sudo journalctl -u loki -f
Common causes:
Loki not running
Wrong Loki URL
Alloy cannot read log files
No permissions on /var/log/*
Problem: Node Exporter works but Fargate metrics missing
Cause:
Node Exporter monitors EC2 only.
It cannot monitor Fargate hosts.
Correct approach:
Use Alloy sidecar with ECS container metrics receiver.
Final Teaching Summary
This lab demonstrates a real DevOps/SRE production pattern:
ECS Fargate runs application containers.
IAM secures container permissions.
Alloy collects telemetry.
Prometheus stores metrics.
Loki stores logs.
Grafana visualizes everything.
Node Exporter monitors the EC2 monitoring server.
The most important SRE mindset:
Metrics tell you what is happening.
Logs tell you why it happened.
Grafana helps you see the story.
IAM and security groups control who can access what.
Top comments (0)