DEV Community

Thiago da Silva
Thiago da Silva

Posted on

Grafana Agent Installation and Configuration

Grafana Agent Installation and Configuration Documentation

Introduction to Grafana Agent

Grafana Agent is a single, lightweight binary that consolidates multiple observability tools into one solution. It replaces the need to run separately:

  • Prometheus for metrics collection
  • Promtail for log collection
  • Node Exporter for system metrics
  • cAdvisor for container metrics
  • Other exporters from the Prometheus ecosystem

Key Advantages

  • Single binary: Reduces operational complexity
  • Smaller footprint: Lower resource usage compared to multiple agents
  • Native Remote Write: Direct integration with Prometheus, Mimir, Cortex
  • Unified configuration: Single YAML for all functionalities
  • Flexibility: Enable/disable components as needed

Architecture and Components

┌─────────────────────────────────────────────────────────────┐
│                    Grafana Agent                            │
├─────────────────┬─────────────────┬─────────────────────────┤
│   Integrations  │     Metrics     │         Logs            │
│                 │                 │                         │
│ ┌─────────────┐ │ ┌─────────────┐ │ ┌─────────────────────┐ │
│ │Node Exporter│ │ │ Prometheus  │ │ │     Promtail        │ │
│ │             │ │ │   Scraper   │ │ │                     │ │
│ │ ┌─────────┐ │ │ │             │ │ │ ┌─────────────────┐ │ │
│ │ │cAdvisor │ │ │ │ ┌─────────┐ │ │ │ │ Docker Logs     │ │ │
│ │ └─────────┘ │ │ │ │ WAL     │ │ │ │ │                 │ │ │
│ │             │ │ │ └─────────┘ │ │ │ │ Syslog          │ │ │
│ │ ┌─────────┐ │ │ │             │ │ │ │                 │ │ │
│ │ │Custom   │ │ │ │Remote Write │ │ │ │ File Logs       │ │ │
│ │ │Exporters│ │ │ │             │ │ │ └─────────────────┘ │ │
│ │ └─────────┘ │ │ └─────────────┘ │ │                     │ │
│ └─────────────┘ │                 │ └─────────────────────┘ │
└─────────────────┴─────────────────┴─────────────────────────┘
                        │                        │
                        ▼                        ▼
              ┌─────────────────┐    ┌─────────────────┐
              │ Mimir/Prometheus │    │      Loki       │
              │     Backend      │    │    Backend      │
              └─────────────────┘    └─────────────────┘
Enter fullscreen mode Exit fullscreen mode

Grafana Agent Installation

1. Download and Installation

# Define version
AGENT_VERSION="v0.40.3"
ARCH="amd64"  # or arm64 for ARM

# Download binary
wget https://github.com/grafana/agent/releases/download/${AGENT_VERSION}/grafana-agent-linux-${ARCH}.zip

# Extract and install
unzip grafana-agent-linux-${ARCH}.zip
sudo mv grafana-agent-linux-${ARCH} /usr/local/bin/grafana-agent
sudo chmod +x /usr/local/bin/grafana-agent

# Verify installation
grafana-agent --version
Enter fullscreen mode Exit fullscreen mode

2. User and Directory Creation

# Create grafana-agent user
sudo useradd --system --no-create-home --shell /bin/false grafana-agent

# Create necessary directories
sudo mkdir -p /etc/grafana-agent
sudo mkdir -p /var/lib/grafana-agent
sudo mkdir -p /var/log/grafana-agent

# Set permissions
sudo chown -R grafana-agent:grafana-agent /var/lib/grafana-agent
sudo chown -R grafana-agent:grafana-agent /var/log/grafana-agent
sudo chown grafana-agent:grafana-agent /etc/grafana-agent
Enter fullscreen mode Exit fullscreen mode

3. Systemd Configuration

Create the service file:

sudo tee /etc/systemd/system/grafana-agent.service > /dev/null <<EOF
[Unit]
Description=Grafana Agent
Documentation=https://grafana.com/docs/agent/
Wants=network-online.target
After=network-online.target
Requires=network.target

[Service]
Type=simple
User=grafana-agent
Group=grafana-agent
ExecStart=/usr/local/bin/grafana-agent --config.file=/etc/grafana-agent/config.yaml --storage.path=/var/lib/grafana-agent
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal

# Resource limits
LimitNOFILE=65536
LimitNPROC=32768

# Security
NoNewPrivileges=true
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
ProtectSystem=strict
ReadWritePaths=/var/lib/grafana-agent /var/log/grafana-agent /tmp

[Install]
WantedBy=multi-user.target
EOF
Enter fullscreen mode Exit fullscreen mode

4. Enable and Start Service

# Reload systemd
sudo systemctl daemon-reload

# Enable for boot startup
sudo systemctl enable grafana-agent

# Don't start yet (we'll configure first)
# sudo systemctl start grafana-agent
Enter fullscreen mode Exit fullscreen mode

Complete Configuration

Main Configuration File

Create the file /etc/grafana-agent/config.yaml:

# /etc/grafana-agent/config.yaml

# Server configurations
server:
  log_level: info
  log_format: logfmt
  http_listen_port: 9090
  grpc_listen_port: 9091

# Integrations (Built-in Exporters)
integrations:
  # Integrated Node Exporter
  node_exporter:
    enabled: true
    # Filesystem paths
    rootfs_path: /
    sysfs_path: /sys
    procfs_path: /proc

    # Enabled collectors
    set_collectors:
      - uname
      - cpu
      - loadavg
      - meminfo
      - filesystem
      - netdev
      - diskstats
      - cpufreq
      - os
      - time
      - xfs
      - cpu_guest_seconds_metric
      - boottime
      - systemd
      - processes
      - nvme
      - nfs
      - netstat
      - logind
      - stat
      - vmstat

    # Relabeling configurations
    relabel_configs:
      - action: replace
        replacement: '${INSTANCE_NAME}'
        target_label: instance
      - action: replace
        replacement: '${TENANT_ID}'
        target_label: tenant

  # Integrated cAdvisor (optional)
  cadvisor:
    enabled: true
    docker_only: true
    instance: '${INSTANCE_NAME}'

    relabel_configs:
      - action: replace
        replacement: 'cadvisor'
        target_label: job
      - action: replace
        replacement: '${TENANT_ID}'
        target_label: tenant

    # Remove unnecessary labels
    metric_relabel_configs:
      - action: labeldrop
        regex: 'container_label_com_docker_compose_.*'
      - action: labeldrop
        regex: 'container_label_org_.*'

  # Integrated Process Exporter (optional)
  process_exporter:
    enabled: false
    config:
      process_names:
        - name: "{{.Comm}}"
          cmdline:
          - '.+'

# Metrics Configuration
metrics:
  # Write-Ahead Log directory
  wal_directory: /var/lib/grafana-agent/wal

  # Global configurations
  global:
    scrape_interval: 30s
    scrape_timeout: 10s
    external_labels:
      cluster: '${CLUSTER_NAME}'
      region: '${AWS_REGION}'

    # Remote Write to Mimir/Prometheus
    remote_write:
      - url: https://mimir.${DOMAIN}/api/v1/push
        headers:
          X-Scope-OrgID: '${TENANT_ID}'

        # Queue configurations
        queue_config:
          capacity: 10000
          max_samples_per_send: 2000
          batch_send_deadline: 5s
          min_shards: 1
          max_shards: 200

        # Retry configurations
        write_relabel_configs:
          - source_labels: [__name__]
            regex: 'go_.*'
            action: drop

  # Scrape configurations
  configs:
    - name: default
      scrape_configs:
        # Self-scrape
        - job_name: 'grafana-agent'
          static_configs:
            - targets: ['127.0.0.1:9090']
          scrape_interval: 30s
          relabel_configs:
            - action: replace
              replacement: '${INSTANCE_NAME}'
              target_label: instance

        # Local applications scrape
        - job_name: 'local-apps'
          static_configs:
            - targets: ['127.0.0.1:8080', '127.0.0.1:3000']
          scrape_interval: 15s
          relabel_configs:
            - action: replace
              replacement: '${INSTANCE_NAME}'
              target_label: instance
            - action: replace
              replacement: '${TENANT_ID}'
              target_label: tenant

        # Service Discovery via file
        - job_name: 'file-sd'
          file_sd_configs:
            - files:
              - '/etc/grafana-agent/targets/*.json'
              refresh_interval: 30s
          relabel_configs:
            - action: replace
              replacement: '${INSTANCE_NAME}'
              target_label: instance

# Logs Configuration
logs:
  configs:
    - name: default
      # Client for Loki
      clients:
        - url: https://loki.${DOMAIN}/loki/api/v1/push
          headers:
            X-Scope-OrgID: '${TENANT_ID}'

          # Batching configurations
          batchwait: 1s
          batchsize: 1048576

          # Retry configurations
          backoff_config:
            min_period: 500ms
            max_period: 5m
            max_retries: 10

      # Positions file
      positions:
        filename: /var/lib/grafana-agent/positions.yaml

      # Log scrape configurations
      scrape_configs:
        # Docker logs
        - job_name: docker
          docker_sd_configs:
            - host: "unix:///var/run/docker.sock"
              refresh_interval: 30s

          relabel_configs:
            - source_labels: [__meta_docker_container_name]
              target_label: container
            - source_labels: [__meta_docker_container_name]
              target_label: service_name
            - source_labels: [__meta_docker_container_log_stream]
              target_label: stream
            - action: replace
              replacement: '${INSTANCE_NAME}'
              target_label: instance
            - action: replace
              replacement: '${TENANT_ID}'
              target_label: tenant

        # System logs via journald
        - job_name: systemd
          journal:
            json: false
            max_age: 12h
            path: /var/log/journal

          relabel_configs:
            - source_labels: [__journal__systemd_unit]
              target_label: unit
            - source_labels: [__journal__hostname]
              target_label: hostname
            - action: replace
              replacement: '${INSTANCE_NAME}'
              target_label: instance
            - action: replace
              replacement: '${TENANT_ID}'
              target_label: tenant

        # Specific file logs
        - job_name: syslog
          static_configs:
            - targets: [localhost]
              labels:
                job: syslog
                tenant: '${TENANT_ID}'
                __path__: /var/log/syslog

          relabel_configs:
            - action: replace
              target_label: instance
              replacement: '${INSTANCE_NAME}'

        # Custom application logs
        - job_name: app-logs
          static_configs:
            - targets: [localhost]
              labels:
                job: app-logs
                tenant: '${TENANT_ID}'
                __path__: /var/log/myapp/*.log

          # Processing pipeline
          pipeline_stages:
            - json:
                expressions:
                  timestamp: timestamp
                  level: level
                  message: message
                  module: module

            - timestamp:
                source: timestamp
                format: RFC3339Nano

            - labels:
                level:
                module:

# Traces Configuration (optional)
traces:
  configs:
    - name: default
      receivers:
        jaeger:
          protocols:
            thrift_http:
              endpoint: 0.0.0.0:14268
            grpc:
              endpoint: 0.0.0.0:14250

        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317
            http:
              endpoint: 0.0.0.0:4318

      remote_write:
        - endpoint: https://tempo.${DOMAIN}:443
          headers:
            X-Scope-OrgID: '${TENANT_ID}'
Enter fullscreen mode Exit fullscreen mode

Environment Variables File

Create /etc/grafana-agent/environment:

# /etc/grafana-agent/environment

# Instance identification
INSTANCE_NAME="server-001"
TENANT_ID="company"

# Cluster and region
CLUSTER_NAME="production"
AWS_REGION="us-east-1"

# Base domain
DOMAIN="monitoring.company.com"

# Credentials (if needed)
# AWS_ACCESS_KEY_ID="your-access-key"
# AWS_SECRET_ACCESS_KEY="your-secret-key"
Enter fullscreen mode Exit fullscreen mode

Configuration Script with Variable Substitution

Create /usr/local/bin/setup-grafana-agent.sh:

#!/bin/bash

# Script to configure Grafana Agent with variable substitution

set -e

# Load environment variables
source /etc/grafana-agent/environment

# Function to substitute variables in configuration file
substitute_variables() {
    local config_file="/etc/grafana-agent/config.yaml"
    local temp_file="/tmp/config.yaml.tmp"

    # Substitute variables
    envsubst < "${config_file}.template" > "${temp_file}"

    # Validate configuration
    if grafana-agent --config.file="${temp_file}" --config.validate; then
        mv "${temp_file}" "${config_file}"
        chown grafana-agent:grafana-agent "${config_file}"
        echo "Configuration updated successfully"
    else
        echo "Configuration validation error"
        rm -f "${temp_file}"
        exit 1
    fi
}

# Check if template exists
if [[ ! -f "/etc/grafana-agent/config.yaml.template" ]]; then
    echo "Configuration template not found"
    exit 1
fi

# Substitute variables
substitute_variables

# Reload service if running
if systemctl is-active --quiet grafana-agent; then
    systemctl reload grafana-agent
    echo "Service reloaded"
fi
Enter fullscreen mode Exit fullscreen mode

Dynamic Service Discovery

File SD Configuration

Create directory for targets:

sudo mkdir -p /etc/grafana-agent/targets
sudo chown grafana-agent:grafana-agent /etc/grafana-agent/targets
Enter fullscreen mode Exit fullscreen mode

Example targets file (/etc/grafana-agent/targets/web-servers.json):

[
  {
    "targets": ["192.168.1.10:9100", "192.168.1.11:9100"],
    "labels": {
      "job": "node-exporter",
      "env": "production",
      "team": "infrastructure"
    }
  },
  {
    "targets": ["192.168.1.20:8080", "192.168.1.21:8080"],
    "labels": {
      "job": "web-app",
      "env": "production",
      "team": "backend"
    }
  }
]
Enter fullscreen mode Exit fullscreen mode

Management Scripts

Status Script

#!/bin/bash
# /usr/local/bin/grafana-agent-status.sh

echo "=== Grafana Agent Status ==="
systemctl status grafana-agent --no-pager

echo ""
echo "=== Recent logs ==="
journalctl -u grafana-agent --no-pager -n 20

echo ""
echo "=== Resource usage ==="
ps aux | grep grafana-agent | grep -v grep

echo ""
echo "=== Connectivity check ==="
curl -s http://localhost:9090/-/ready && echo "Agent ready" || echo "Agent not ready"

echo ""
echo "=== Status metrics ==="
curl -s http://localhost:9090/metrics | grep -E 'prometheus_agent_|agent_build_info'
Enter fullscreen mode Exit fullscreen mode

Configuration Backup Script

#!/bin/bash
# /usr/local/bin/backup-agent-config.sh

BACKUP_DIR="/var/backups/grafana-agent"
DATE=$(date +%Y%m%d_%H%M%S)

mkdir -p "$BACKUP_DIR"

# Configuration backup
tar -czf "$BACKUP_DIR/grafana-agent-config-$DATE.tar.gz" \
    /etc/grafana-agent/ \
    /etc/systemd/system/grafana-agent.service

# Keep only last 10 backups
ls -t "$BACKUP_DIR"/grafana-agent-config-*.tar.gz | tail -n +11 | xargs -r rm

echo "Backup created: $BACKUP_DIR/grafana-agent-config-$DATE.tar.gz"
Enter fullscreen mode Exit fullscreen mode

Initialization and Verification

First Startup

# 1. Create configuration from template
sudo cp /etc/grafana-agent/config.yaml /etc/grafana-agent/config.yaml.template

# 2. Run configuration script
sudo /usr/local/bin/setup-grafana-agent.sh

# 3. Validate configuration
sudo grafana-agent --config.file=/etc/grafana-agent/config.yaml --config.validate

# 4. Start service
sudo systemctl start grafana-agent

# 5. Check status
sudo systemctl status grafana-agent

# 6. Check logs
sudo journalctl -u grafana-agent -f
Enter fullscreen mode Exit fullscreen mode

Verification Commands

# Service status
systemctl status grafana-agent

# Real-time logs
journalctl -u grafana-agent -f

# Check if listening on ports
ss -tulpn | grep grafana-agent

# Check agent metrics
curl http://localhost:9090/metrics

# Check readiness
curl http://localhost:9090/-/ready

# Check current configuration
curl http://localhost:9090/-/config

# Check discovered targets
curl http://localhost:9090/api/v1/targets
Enter fullscreen mode Exit fullscreen mode

Monitoring and Troubleshooting

Key Agent Metrics

# Agent CPU usage
rate(process_cpu_seconds_total{job="grafana-agent"}[5m])

# Memory used
process_resident_memory_bytes{job="grafana-agent"}

# Samples sent via remote write
rate(prometheus_remote_storage_samples_total[5m])

# Remote write failures
rate(prometheus_remote_storage_samples_failed_total[5m])

# WAL size
prometheus_tsdb_wal_size_bytes

# Logs sent
rate(promtail_sent_entries_total[5m])
Enter fullscreen mode Exit fullscreen mode

Common Issues and Solutions

1. Remote Write Failures

# Check connectivity
curl -I https://mimir.domain.com/api/v1/push

# Check certificates
openssl s_client -connect mimir.domain.com:443

# Check specific logs
journalctl -u grafana-agent | grep "remote_write"
Enter fullscreen mode Exit fullscreen mode

2. High Memory Usage

# Adjust configurations in config.yaml
metrics:
  global:
    remote_write:
      - queue_config:
          capacity: 5000        # Reduce from 10000
          max_samples_per_send: 1000  # Reduce from 2000
Enter fullscreen mode Exit fullscreen mode

3. Logs Not Being Collected

# Check permissions
ls -la /var/log/
ls -la /var/run/docker.sock

# Add user to docker group (for Docker logs)
sudo usermod -a -G docker grafana-agent
sudo systemctl restart grafana-agent
Enter fullscreen mode Exit fullscreen mode

Advanced Configurations

Rate Limiting

metrics:
  configs:
    - name: default
      scrape_configs:
        - job_name: 'rate-limited-app'
          static_configs:
            - targets: ['app:8080']
          scrape_interval: 1m  # Less frequent for sensitive apps
          metrics_path: /metrics
          honor_labels: true
Enter fullscreen mode Exit fullscreen mode

Metrics Filtering

metrics:
  global:
    remote_write:
      - url: https://mimir.domain.com/api/v1/push
        write_relabel_configs:
          # Drop specific metrics
          - source_labels: [__name__]
            regex: 'go_gc_.*|go_memstats_.*'
            action: drop

          # Keep only important metrics
          - source_labels: [__name__]
            regex: 'up|cpu_usage_.*|memory_usage_.*'
            action: keep
Enter fullscreen mode Exit fullscreen mode

Multi-tenancy

metrics:
  global:
    remote_write:
      # Tenant A
      - url: https://mimir.domain.com/api/v1/push
        headers:
          X-Scope-OrgID: 'tenant-a'
        write_relabel_configs:
          - source_labels: [tenant]
            regex: 'tenant-a'
            action: keep

      # Tenant B  
      - url: https://mimir.domain.com/api/v1/push
        headers:
          X-Scope-OrgID: 'tenant-b'
        write_relabel_configs:
          - source_labels: [tenant]
            regex: 'tenant-b'
            action: keep
Enter fullscreen mode Exit fullscreen mode

Conclusion

Grafana Agent provides a unified and efficient solution for observability, consolidating multiple tools into a single binary. With native remote write and embedded integrations, it significantly simplifies monitoring architecture, reducing operational complexity and resource overhead.

Implementation Benefits:

  1. Simplicity: Single agent for metrics, logs, and traces
  2. Efficiency: Lower resource usage than multiple agents
  3. Flexibility: Modular configuration by need
  4. Scalability: Native remote write for distributed backends
  5. Maintenance: Centralized management via systemd

Top comments (0)