Grafana Agent Installation and Configuration Documentation
Introduction to Grafana Agent
Grafana Agent is a single, lightweight binary that consolidates multiple observability tools into one solution. It replaces the need to run separately:
- Prometheus for metrics collection
- Promtail for log collection
- Node Exporter for system metrics
- cAdvisor for container metrics
- Other exporters from the Prometheus ecosystem
Key Advantages
- Single binary: Reduces operational complexity
- Smaller footprint: Lower resource usage compared to multiple agents
- Native Remote Write: Direct integration with Prometheus, Mimir, Cortex
- Unified configuration: Single YAML for all functionalities
- Flexibility: Enable/disable components as needed
Architecture and Components
┌─────────────────────────────────────────────────────────────┐
│ Grafana Agent │
├─────────────────┬─────────────────┬─────────────────────────┤
│ Integrations │ Metrics │ Logs │
│ │ │ │
│ ┌─────────────┐ │ ┌─────────────┐ │ ┌─────────────────────┐ │
│ │Node Exporter│ │ │ Prometheus │ │ │ Promtail │ │
│ │ │ │ │ Scraper │ │ │ │ │
│ │ ┌─────────┐ │ │ │ │ │ │ ┌─────────────────┐ │ │
│ │ │cAdvisor │ │ │ │ ┌─────────┐ │ │ │ │ Docker Logs │ │ │
│ │ └─────────┘ │ │ │ │ WAL │ │ │ │ │ │ │ │
│ │ │ │ │ └─────────┘ │ │ │ │ Syslog │ │ │
│ │ ┌─────────┐ │ │ │ │ │ │ │ │ │ │
│ │ │Custom │ │ │ │Remote Write │ │ │ │ File Logs │ │ │
│ │ │Exporters│ │ │ │ │ │ │ └─────────────────┘ │ │
│ │ └─────────┘ │ │ └─────────────┘ │ │ │ │
│ └─────────────┘ │ │ └─────────────────────┘ │
└─────────────────┴─────────────────┴─────────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Mimir/Prometheus │ │ Loki │
│ Backend │ │ Backend │
└─────────────────┘ └─────────────────┘
Grafana Agent Installation
1. Download and Installation
# Define version
AGENT_VERSION="v0.40.3"
ARCH="amd64" # or arm64 for ARM
# Download binary
wget https://github.com/grafana/agent/releases/download/${AGENT_VERSION}/grafana-agent-linux-${ARCH}.zip
# Extract and install
unzip grafana-agent-linux-${ARCH}.zip
sudo mv grafana-agent-linux-${ARCH} /usr/local/bin/grafana-agent
sudo chmod +x /usr/local/bin/grafana-agent
# Verify installation
grafana-agent --version
2. User and Directory Creation
# Create grafana-agent user
sudo useradd --system --no-create-home --shell /bin/false grafana-agent
# Create necessary directories
sudo mkdir -p /etc/grafana-agent
sudo mkdir -p /var/lib/grafana-agent
sudo mkdir -p /var/log/grafana-agent
# Set permissions
sudo chown -R grafana-agent:grafana-agent /var/lib/grafana-agent
sudo chown -R grafana-agent:grafana-agent /var/log/grafana-agent
sudo chown grafana-agent:grafana-agent /etc/grafana-agent
3. Systemd Configuration
Create the service file:
sudo tee /etc/systemd/system/grafana-agent.service > /dev/null <<EOF
[Unit]
Description=Grafana Agent
Documentation=https://grafana.com/docs/agent/
Wants=network-online.target
After=network-online.target
Requires=network.target
[Service]
Type=simple
User=grafana-agent
Group=grafana-agent
ExecStart=/usr/local/bin/grafana-agent --config.file=/etc/grafana-agent/config.yaml --storage.path=/var/lib/grafana-agent
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal
# Resource limits
LimitNOFILE=65536
LimitNPROC=32768
# Security
NoNewPrivileges=true
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
ProtectSystem=strict
ReadWritePaths=/var/lib/grafana-agent /var/log/grafana-agent /tmp
[Install]
WantedBy=multi-user.target
EOF
4. Enable and Start Service
# Reload systemd
sudo systemctl daemon-reload
# Enable for boot startup
sudo systemctl enable grafana-agent
# Don't start yet (we'll configure first)
# sudo systemctl start grafana-agent
Complete Configuration
Main Configuration File
Create the file /etc/grafana-agent/config.yaml
:
# /etc/grafana-agent/config.yaml
# Server configurations
server:
log_level: info
log_format: logfmt
http_listen_port: 9090
grpc_listen_port: 9091
# Integrations (Built-in Exporters)
integrations:
# Integrated Node Exporter
node_exporter:
enabled: true
# Filesystem paths
rootfs_path: /
sysfs_path: /sys
procfs_path: /proc
# Enabled collectors
set_collectors:
- uname
- cpu
- loadavg
- meminfo
- filesystem
- netdev
- diskstats
- cpufreq
- os
- time
- xfs
- cpu_guest_seconds_metric
- boottime
- systemd
- processes
- nvme
- nfs
- netstat
- logind
- stat
- vmstat
# Relabeling configurations
relabel_configs:
- action: replace
replacement: '${INSTANCE_NAME}'
target_label: instance
- action: replace
replacement: '${TENANT_ID}'
target_label: tenant
# Integrated cAdvisor (optional)
cadvisor:
enabled: true
docker_only: true
instance: '${INSTANCE_NAME}'
relabel_configs:
- action: replace
replacement: 'cadvisor'
target_label: job
- action: replace
replacement: '${TENANT_ID}'
target_label: tenant
# Remove unnecessary labels
metric_relabel_configs:
- action: labeldrop
regex: 'container_label_com_docker_compose_.*'
- action: labeldrop
regex: 'container_label_org_.*'
# Integrated Process Exporter (optional)
process_exporter:
enabled: false
config:
process_names:
- name: "{{.Comm}}"
cmdline:
- '.+'
# Metrics Configuration
metrics:
# Write-Ahead Log directory
wal_directory: /var/lib/grafana-agent/wal
# Global configurations
global:
scrape_interval: 30s
scrape_timeout: 10s
external_labels:
cluster: '${CLUSTER_NAME}'
region: '${AWS_REGION}'
# Remote Write to Mimir/Prometheus
remote_write:
- url: https://mimir.${DOMAIN}/api/v1/push
headers:
X-Scope-OrgID: '${TENANT_ID}'
# Queue configurations
queue_config:
capacity: 10000
max_samples_per_send: 2000
batch_send_deadline: 5s
min_shards: 1
max_shards: 200
# Retry configurations
write_relabel_configs:
- source_labels: [__name__]
regex: 'go_.*'
action: drop
# Scrape configurations
configs:
- name: default
scrape_configs:
# Self-scrape
- job_name: 'grafana-agent'
static_configs:
- targets: ['127.0.0.1:9090']
scrape_interval: 30s
relabel_configs:
- action: replace
replacement: '${INSTANCE_NAME}'
target_label: instance
# Local applications scrape
- job_name: 'local-apps'
static_configs:
- targets: ['127.0.0.1:8080', '127.0.0.1:3000']
scrape_interval: 15s
relabel_configs:
- action: replace
replacement: '${INSTANCE_NAME}'
target_label: instance
- action: replace
replacement: '${TENANT_ID}'
target_label: tenant
# Service Discovery via file
- job_name: 'file-sd'
file_sd_configs:
- files:
- '/etc/grafana-agent/targets/*.json'
refresh_interval: 30s
relabel_configs:
- action: replace
replacement: '${INSTANCE_NAME}'
target_label: instance
# Logs Configuration
logs:
configs:
- name: default
# Client for Loki
clients:
- url: https://loki.${DOMAIN}/loki/api/v1/push
headers:
X-Scope-OrgID: '${TENANT_ID}'
# Batching configurations
batchwait: 1s
batchsize: 1048576
# Retry configurations
backoff_config:
min_period: 500ms
max_period: 5m
max_retries: 10
# Positions file
positions:
filename: /var/lib/grafana-agent/positions.yaml
# Log scrape configurations
scrape_configs:
# Docker logs
- job_name: docker
docker_sd_configs:
- host: "unix:///var/run/docker.sock"
refresh_interval: 30s
relabel_configs:
- source_labels: [__meta_docker_container_name]
target_label: container
- source_labels: [__meta_docker_container_name]
target_label: service_name
- source_labels: [__meta_docker_container_log_stream]
target_label: stream
- action: replace
replacement: '${INSTANCE_NAME}'
target_label: instance
- action: replace
replacement: '${TENANT_ID}'
target_label: tenant
# System logs via journald
- job_name: systemd
journal:
json: false
max_age: 12h
path: /var/log/journal
relabel_configs:
- source_labels: [__journal__systemd_unit]
target_label: unit
- source_labels: [__journal__hostname]
target_label: hostname
- action: replace
replacement: '${INSTANCE_NAME}'
target_label: instance
- action: replace
replacement: '${TENANT_ID}'
target_label: tenant
# Specific file logs
- job_name: syslog
static_configs:
- targets: [localhost]
labels:
job: syslog
tenant: '${TENANT_ID}'
__path__: /var/log/syslog
relabel_configs:
- action: replace
target_label: instance
replacement: '${INSTANCE_NAME}'
# Custom application logs
- job_name: app-logs
static_configs:
- targets: [localhost]
labels:
job: app-logs
tenant: '${TENANT_ID}'
__path__: /var/log/myapp/*.log
# Processing pipeline
pipeline_stages:
- json:
expressions:
timestamp: timestamp
level: level
message: message
module: module
- timestamp:
source: timestamp
format: RFC3339Nano
- labels:
level:
module:
# Traces Configuration (optional)
traces:
configs:
- name: default
receivers:
jaeger:
protocols:
thrift_http:
endpoint: 0.0.0.0:14268
grpc:
endpoint: 0.0.0.0:14250
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
remote_write:
- endpoint: https://tempo.${DOMAIN}:443
headers:
X-Scope-OrgID: '${TENANT_ID}'
Environment Variables File
Create /etc/grafana-agent/environment
:
# /etc/grafana-agent/environment
# Instance identification
INSTANCE_NAME="server-001"
TENANT_ID="company"
# Cluster and region
CLUSTER_NAME="production"
AWS_REGION="us-east-1"
# Base domain
DOMAIN="monitoring.company.com"
# Credentials (if needed)
# AWS_ACCESS_KEY_ID="your-access-key"
# AWS_SECRET_ACCESS_KEY="your-secret-key"
Configuration Script with Variable Substitution
Create /usr/local/bin/setup-grafana-agent.sh
:
#!/bin/bash
# Script to configure Grafana Agent with variable substitution
set -e
# Load environment variables
source /etc/grafana-agent/environment
# Function to substitute variables in configuration file
substitute_variables() {
local config_file="/etc/grafana-agent/config.yaml"
local temp_file="/tmp/config.yaml.tmp"
# Substitute variables
envsubst < "${config_file}.template" > "${temp_file}"
# Validate configuration
if grafana-agent --config.file="${temp_file}" --config.validate; then
mv "${temp_file}" "${config_file}"
chown grafana-agent:grafana-agent "${config_file}"
echo "Configuration updated successfully"
else
echo "Configuration validation error"
rm -f "${temp_file}"
exit 1
fi
}
# Check if template exists
if [[ ! -f "/etc/grafana-agent/config.yaml.template" ]]; then
echo "Configuration template not found"
exit 1
fi
# Substitute variables
substitute_variables
# Reload service if running
if systemctl is-active --quiet grafana-agent; then
systemctl reload grafana-agent
echo "Service reloaded"
fi
Dynamic Service Discovery
File SD Configuration
Create directory for targets:
sudo mkdir -p /etc/grafana-agent/targets
sudo chown grafana-agent:grafana-agent /etc/grafana-agent/targets
Example targets file (/etc/grafana-agent/targets/web-servers.json
):
[
{
"targets": ["192.168.1.10:9100", "192.168.1.11:9100"],
"labels": {
"job": "node-exporter",
"env": "production",
"team": "infrastructure"
}
},
{
"targets": ["192.168.1.20:8080", "192.168.1.21:8080"],
"labels": {
"job": "web-app",
"env": "production",
"team": "backend"
}
}
]
Management Scripts
Status Script
#!/bin/bash
# /usr/local/bin/grafana-agent-status.sh
echo "=== Grafana Agent Status ==="
systemctl status grafana-agent --no-pager
echo ""
echo "=== Recent logs ==="
journalctl -u grafana-agent --no-pager -n 20
echo ""
echo "=== Resource usage ==="
ps aux | grep grafana-agent | grep -v grep
echo ""
echo "=== Connectivity check ==="
curl -s http://localhost:9090/-/ready && echo "Agent ready" || echo "Agent not ready"
echo ""
echo "=== Status metrics ==="
curl -s http://localhost:9090/metrics | grep -E 'prometheus_agent_|agent_build_info'
Configuration Backup Script
#!/bin/bash
# /usr/local/bin/backup-agent-config.sh
BACKUP_DIR="/var/backups/grafana-agent"
DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p "$BACKUP_DIR"
# Configuration backup
tar -czf "$BACKUP_DIR/grafana-agent-config-$DATE.tar.gz" \
/etc/grafana-agent/ \
/etc/systemd/system/grafana-agent.service
# Keep only last 10 backups
ls -t "$BACKUP_DIR"/grafana-agent-config-*.tar.gz | tail -n +11 | xargs -r rm
echo "Backup created: $BACKUP_DIR/grafana-agent-config-$DATE.tar.gz"
Initialization and Verification
First Startup
# 1. Create configuration from template
sudo cp /etc/grafana-agent/config.yaml /etc/grafana-agent/config.yaml.template
# 2. Run configuration script
sudo /usr/local/bin/setup-grafana-agent.sh
# 3. Validate configuration
sudo grafana-agent --config.file=/etc/grafana-agent/config.yaml --config.validate
# 4. Start service
sudo systemctl start grafana-agent
# 5. Check status
sudo systemctl status grafana-agent
# 6. Check logs
sudo journalctl -u grafana-agent -f
Verification Commands
# Service status
systemctl status grafana-agent
# Real-time logs
journalctl -u grafana-agent -f
# Check if listening on ports
ss -tulpn | grep grafana-agent
# Check agent metrics
curl http://localhost:9090/metrics
# Check readiness
curl http://localhost:9090/-/ready
# Check current configuration
curl http://localhost:9090/-/config
# Check discovered targets
curl http://localhost:9090/api/v1/targets
Monitoring and Troubleshooting
Key Agent Metrics
# Agent CPU usage
rate(process_cpu_seconds_total{job="grafana-agent"}[5m])
# Memory used
process_resident_memory_bytes{job="grafana-agent"}
# Samples sent via remote write
rate(prometheus_remote_storage_samples_total[5m])
# Remote write failures
rate(prometheus_remote_storage_samples_failed_total[5m])
# WAL size
prometheus_tsdb_wal_size_bytes
# Logs sent
rate(promtail_sent_entries_total[5m])
Common Issues and Solutions
1. Remote Write Failures
# Check connectivity
curl -I https://mimir.domain.com/api/v1/push
# Check certificates
openssl s_client -connect mimir.domain.com:443
# Check specific logs
journalctl -u grafana-agent | grep "remote_write"
2. High Memory Usage
# Adjust configurations in config.yaml
metrics:
global:
remote_write:
- queue_config:
capacity: 5000 # Reduce from 10000
max_samples_per_send: 1000 # Reduce from 2000
3. Logs Not Being Collected
# Check permissions
ls -la /var/log/
ls -la /var/run/docker.sock
# Add user to docker group (for Docker logs)
sudo usermod -a -G docker grafana-agent
sudo systemctl restart grafana-agent
Advanced Configurations
Rate Limiting
metrics:
configs:
- name: default
scrape_configs:
- job_name: 'rate-limited-app'
static_configs:
- targets: ['app:8080']
scrape_interval: 1m # Less frequent for sensitive apps
metrics_path: /metrics
honor_labels: true
Metrics Filtering
metrics:
global:
remote_write:
- url: https://mimir.domain.com/api/v1/push
write_relabel_configs:
# Drop specific metrics
- source_labels: [__name__]
regex: 'go_gc_.*|go_memstats_.*'
action: drop
# Keep only important metrics
- source_labels: [__name__]
regex: 'up|cpu_usage_.*|memory_usage_.*'
action: keep
Multi-tenancy
metrics:
global:
remote_write:
# Tenant A
- url: https://mimir.domain.com/api/v1/push
headers:
X-Scope-OrgID: 'tenant-a'
write_relabel_configs:
- source_labels: [tenant]
regex: 'tenant-a'
action: keep
# Tenant B
- url: https://mimir.domain.com/api/v1/push
headers:
X-Scope-OrgID: 'tenant-b'
write_relabel_configs:
- source_labels: [tenant]
regex: 'tenant-b'
action: keep
Conclusion
Grafana Agent provides a unified and efficient solution for observability, consolidating multiple tools into a single binary. With native remote write and embedded integrations, it significantly simplifies monitoring architecture, reducing operational complexity and resource overhead.
Implementation Benefits:
- Simplicity: Single agent for metrics, logs, and traces
- Efficiency: Lower resource usage than multiple agents
- Flexibility: Modular configuration by need
- Scalability: Native remote write for distributed backends
- Maintenance: Centralized management via systemd
Top comments (0)