DEV Community

Thiago da Silva
Thiago da Silva

Posted on

Grafana K8s Stack Implementation on Kubernetes Cluster

Grafana K8s Stack Implementation on Kubernetes Cluster

Introduction

This article describes the complete implementation of an observability stack based on the Grafana ecosystem in a Kubernetes cluster. The stack includes:

  • Grafana Alloy: Metrics and logs collection agent
  • Loki: Log aggregation system
  • Mimir: Long-term metrics storage system
  • Grafana: Visualization interface and dashboards

Solution Architecture

The solution uses a distributed architecture where:

  1. Alloy collects metrics and logs from the cluster
  2. Loki stores and indexes logs with configurable retention
  3. Mimir stores long-term metrics
  4. Grafana provides the visualization interface
  5. AWS Load Balancer Controller manages ingress

Prerequisites

  • Running Kubernetes cluster
  • Helm 3.x installed
  • AWS Load Balancer Controller configured
  • Valid TLS certificates
  • S3 buckets configured for storage

1. Grafana Alloy Configuration

Alloy is responsible for collecting metrics and logs from the Kubernetes cluster.

# values-alloy.yaml
alloy-logs:
  config:
    client:
      timeout: 30s
    resources:
      limits:
        memory: 256Mi
      requests:
        cpu: 100m
        memory: 128Mi
  enabled: true

alloy-metrics:
  config:
    global:
      external_labels:
        cluster: ${CLUSTER_NAME}
        source: alloy-agent
      scrape_interval: 3m
      scrape_timeout: 2m30s
    remote_write:
    - headers:
        X-Scope-OrgID: ${TENANT_ID}
      queue_config:
        batch_send_deadline: 30s
        max_backoff: 10s
        max_retries: 10
        max_samples_per_send: 1000
        min_backoff: 100ms
        retry_on_http_429: true
      send_exemplars: true
      url: http://mimir-gateway.mimir.svc.cluster.local/api/v1/push
  enabled: true
  resources:
    limits:
      memory: 768Mi
    requests:
      cpu: 250m
      memory: 384Mi

cluster:
  name: ${CLUSTER_NAME}

clusterEvents:
  enabled: true
  scrapeInterval: 3m

clusterMetrics:
  cadvisor:
    enabled: true
    metricRelabelings:
    - action: keep
      regex: container_cpu_.*|container_memory_.*|container_network_.*|container_fs_.*|machine_cpu_.*|container_spec_.*
      sourceLabels:
      - __name__
    relabelings:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
  enabled: true
  kubelet:
    enabled: true
  nodeExporter:
    enabled: true

externalServices:
  loki:
    externalLabels:
      cluster: ${CLUSTER_NAME}
      tenant: ${TENANT_ID}
    host: http://loki-loki-distributed-gateway.loki.svc.cluster.local
    tenantId: ${TENANT_ID}
    writeEndpoint: /loki/api/v1/push
  prometheus:
    extraHeaders:
      X-Scope-OrgID: ${TENANT_ID}
    host: http://mimir-gateway.mimir.svc.cluster.local
    tenantId: ${TENANT_ID}
    writeEndpoint: /api/v1/push

extraScrapeConfigs: |-
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: replace
        target_label: container
    scrape_interval: 3m
    scrape_timeout: 2m30s

  - job_name: 'kubernetes-cadvisor'
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecure_skip_verify: true
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    kubernetes_sd_configs:
    - role: node
    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
    - target_label: __address__
      replacement: kubernetes.default.svc:443
    - source_labels: [__meta_kubernetes_node_name]
      regex: (.+)
      target_label: __metrics_path__
      replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    metric_relabel_configs:
    - source_labels: [__name__]
      regex: 'container_cpu_.*|container_memory_.*|machine_cpu_.*|container_fs_.*|container_network_.*|container_spec_.*'
      action: keep
    scrape_interval: 3m
    scrape_timeout: 2m30s

kube-state-metrics:
  enabled: true
  metricLabelsAllowlist:
  - pods=[app,name,component]
  - deployments=[app,name,component]
  - statefulsets=[app,name,component]
  resources:
    limits:
      memory: 256Mi
    requests:
      cpu: 100m
      memory: 128Mi

logs:
  enabled: true
  extraClientConfig: |
    client:
      timeout: 10s
      batchwait: 1s 
      batchsize: 1048576
      max_retries: 5
      retry_on_status_codes: [429, 500, 502, 503, 504]
      backoff_config:
        min_period: 500ms
        max_period: 5m
        max_retries: 10
  pod_logs:
    discovery: all
    enabled: true
    namespaceAllowlist:
    - .*
    scrapeInterval: 3m
    stages:
    - json:
        expressions:
          level: level
          message: msg
          timestamp: time
    - timestamp:
        format: RFC3339Nano
        source: timestamp
    - labels:
        level: null

metrics:
  enabled: true
  scrapeInterval: 3m

opencost:
  enabled: false

prometheus-node-exporter:
  enabled: true

traces:
  enabled: false
Enter fullscreen mode Exit fullscreen mode

Alloy Installation

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Create namespace
kubectl create namespace monitoring

# Install Alloy
helm upgrade --install alloy grafana/k8s-monitoring \
  --namespace monitoring \
  --values values-alloy.yaml \
  --set cluster.name=${CLUSTER_NAME} \
  --set externalServices.loki.tenantId=${TENANT_ID} \
  --set externalServices.prometheus.tenantId=${TENANT_ID}
Enter fullscreen mode Exit fullscreen mode

2. Loki Configuration

Loki is responsible for log aggregation and storage.

# values-loki.yaml
nameOverride: "loki-distributed"

compactor:
  enabled: true
  persistence:
    enabled: false
  replicas: 1

distributor:
  maxUnavailable: 1
  replicas: 2

extraEnvVars:
- name: AWS_ACCESS_KEY_ID
  valueFrom:
    secretKeyRef:
      name: aws-credentials
      key: access_key_id
- name: AWS_SECRET_ACCESS_KEY
  valueFrom:
    secretKeyRef:
      name: aws-credentials
      key: secret_access_key

gateway:
  enabled: true
  ingress:
    annotations:
      alb.ingress.kubernetes.io/certificate-arn: ${TLS_CERTIFICATE_ARN}
      alb.ingress.kubernetes.io/group.name: ${ALB_GROUP_NAME}
      alb.ingress.kubernetes.io/healthcheck-path: /ready
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/ssl-redirect: "443"
      alb.ingress.kubernetes.io/target-type: ip
      kubernetes.io/ingress.class: alb
    enabled: true
    hosts:
    - host: loki.${DOMAIN}
      paths:
      - path: /
        pathType: Prefix
    ingressClassName: alb
  replicas: 1
  service:
    port: 80
    type: ClusterIP

ingester:
  maxUnavailable: 1
  persistence:
    enabled: false
  replicas: 2

loki:
  commonConfig:
    path_prefix: /var/loki
    replication_factor: 1
  structuredConfig:
    auth_enabled: true
    common:
      storage:
        filesystem:
          chunks_directory: /var/loki/chunks
          rules_directory: /var/loki/rules
    limits_config:
      ingestion_burst_size_mb: 6
      ingestion_rate_mb: 4
      max_global_streams_per_user: 15000
      max_query_series: 500
      reject_old_samples: false
      reject_old_samples_max_age: 168h
      retention_period: 24h
      volume_enabled: true

    compactor:
      working_directory: /var/loki/compactor
      shared_store: s3
      compaction_interval: 10m
      retention_enabled: true
      retention_delete_delay: 1h
      retention_delete_worker_count: 150

    memberlist:
      join_members:
        - loki-loki-distributed-memberlist.loki.svc.cluster.local:7946

    schema_config:
      configs:
        - from: "2020-10-24"
          index:
            period: 24h
            prefix: index_
          object_store: aws
          schema: v11
          store: boltdb-shipper

    server:
      grpc_listen_port: 9095
      http_listen_port: 3100

    storage_config:
      aws:
        access_key_id: ${AWS_ACCESS_KEY_ID}
        bucketnames: ${LOKI_S3_BUCKET}
        region: ${AWS_REGION}
        s3: s3://s3.amazonaws.com
        secret_access_key: ${AWS_SECRET_ACCESS_KEY}
      boltdb_shipper:
        active_index_directory: /var/loki/index
        cache_location: /var/loki/boltdb-cache
        shared_store: s3

minio:
  enabled: false

nginx:
  enabled: false

querier:
  replicas: 1

query_frontend:
  maxUnavailable: 0
  replicas: 1

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 256Mi

ruler:
  enabled: false

singleBinary:
  enabled: false
Enter fullscreen mode Exit fullscreen mode

Loki Installation

# Create secret for AWS credentials
kubectl create secret generic aws-credentials \
  --from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
  --from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
  --namespace loki

# Install Loki
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install loki grafana/loki-distributed \
  --namespace loki \
  --create-namespace \
  --values values-loki.yaml
Enter fullscreen mode Exit fullscreen mode

3. Mimir Configuration

Mimir is responsible for long-term metrics storage.

# values-mimir.yaml
alertmanager:
  enabled: false

compactor:
  persistentVolume:
    enabled: false
  replicas: 1

distributor:
  replicas: 2
  extraEnvVars:
  - name: AWS_ACCESS_KEY_ID
    valueFrom:
      secretKeyRef:
        name: aws-credentials
        key: access_key_id
  - name: AWS_SECRET_ACCESS_KEY
    valueFrom:
      secretKeyRef:
        name: aws-credentials
        key: secret_access_key

gateway:
  enabled: true
  enabledNonEnterprise: true
  ingress:
    annotations:
      alb.ingress.kubernetes.io/certificate-arn: ${TLS_CERTIFICATE_ARN}
      alb.ingress.kubernetes.io/group.name: ${ALB_GROUP_NAME}
      alb.ingress.kubernetes.io/healthcheck-path: /ready
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/ssl-redirect: "443"
      alb.ingress.kubernetes.io/target-type: ip
      kubernetes.io/ingress.class: alb
    enabled: true
    hosts:
      - host: mimir.${DOMAIN}
        paths:
          - path: /
            pathType: Prefix
    ingressClassName: alb
  replicas: 1
  service:
    port: 80
    type: ClusterIP

ingester:
  persistentVolume:
    enabled: false
  replicas: 2

mimir:
  structuredConfig:
    alertmanager_storage:
      backend: s3
      s3:
        bucket_name: ${MIMIR_ALERTS_BUCKET}
    blocks_storage:
      backend: s3
      bucket_store:
        sync_dir: /data/tsdb-sync
      s3:
        bucket_name: ${MIMIR_BLOCKS_BUCKET}
      tsdb:
        dir: /data/tsdb
    common:
      storage:
        backend: s3
        s3:
          access_key_id: ${AWS_ACCESS_KEY_ID}
          bucket_name: ${MIMIR_BLOCKS_BUCKET}
          endpoint: s3.amazonaws.com
          region: ${AWS_REGION}
          secret_access_key: ${AWS_SECRET_ACCESS_KEY}
    frontend:
      log_queries_longer_than: 10s
    limits:
      ingestion_burst_size: 150000
      ingestion_rate: 50000
      max_global_series_per_metric: 150000
      max_global_series_per_user: 2000000
    memberlist:
      abort_if_cluster_join_fails: false
      join_members:
        - mimir-gossip-ring.mimir.svc.cluster.local:7946
    multitenancy_enabled: true
    ruler_storage:
      backend: s3
      s3:
        bucket_name: ${MIMIR_RULES_BUCKET}
    server:
      grpc_listen_port: 9095
      http_listen_port: 8080

minio:
  enabled: false

mode: microservices

nginx:
  enabled: false

querier:
  replicas: 1

query_frontend:
  replicas: 1

query_scheduler:
  enabled: true
  replicas: 2
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 100m
      memory: 256Mi

ruler:
  enabled: false

store_gateway:
  persistentVolume:
    enabled: false
  replicas: 1
Enter fullscreen mode Exit fullscreen mode

Mimir Installation

# Create secret for AWS credentials (if not exists)
kubectl create secret generic aws-credentials \
  --from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
  --from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
  --namespace mimir

# Install Mimir
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install mimir grafana/mimir-distributed \
  --namespace mimir \
  --create-namespace \
  --values values-mimir.yaml
Enter fullscreen mode Exit fullscreen mode

4. Grafana Configuration

# values-grafana.yaml
fullnameOverride: "grafana"

# Image configuration
image:
  repository: grafana/grafana
  tag: "12.0.0"

# Admin credentials using secret
admin:
  existingSecret: "grafana-admin-credentials"
  userKey: "admin-user"
  passwordKey: "admin-password"

# Service configuration
service:
  type: ClusterIP
  port: 80

# Persistence
persistence:
  enabled: true
  size: 5Gi

# Database configuration using secrets
env:
  GF_DATABASE_TYPE: postgres
  GF_DATABASE_HOST: ${DATABASE_HOST}
  GF_DATABASE_PORT: "5432"
  GF_DATABASE_NAME: grafana
  GF_DATABASE_USER: ${DATABASE_USER}
  GF_DATABASE_PASSWORD: ${DATABASE_PASSWORD}
  GF_DATABASE_SSL_MODE: disable
  GF_LOG_LEVEL: warn

# Grafana configuration
grafana.ini:
  server:
    domain: grafana.${DOMAIN}
    root_url: https://grafana.${DOMAIN}

# Ingress configuration
ingress:
  enabled: true
  ingressClassName: alb
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/group.name: ${ALB_GROUP_NAME}
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/healthcheck-path: /api/health
    alb.ingress.kubernetes.io/certificate-arn: ${TLS_CERTIFICATE_ARN}
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
    alb.ingress.kubernetes.io/ssl-redirect: '443'
  hosts:
    - grafana.${DOMAIN}
  path: /
  pathType: Prefix
Enter fullscreen mode Exit fullscreen mode

Grafana Installation

# Create necessary secrets
kubectl create secret generic grafana-admin-credentials \
  --from-literal=admin-user=${GRAFANA_ADMIN_USER} \
  --from-literal=admin-password=${GRAFANA_ADMIN_PASSWORD} \
  --namespace grafana

# Install Grafana
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install grafana grafana/grafana \
  --namespace grafana \
  --create-namespace \
  --values values-grafana.yaml
Enter fullscreen mode Exit fullscreen mode

5. IngressClass Configuration

# ingress-class.yaml
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: alb
spec:
  controller: ingress.k8s.aws/alb
Enter fullscreen mode Exit fullscreen mode
kubectl apply -f ingress-class.yaml
Enter fullscreen mode Exit fullscreen mode

6. Grafana Agent Configuration (Standalone)

For monitoring external resources outside the cluster:

# agent-config.yaml
server:
  log_level: debug

integrations:
  node_exporter:
    enabled: true
    rootfs_path: /
    sysfs_path: /sys
    procfs_path: /proc
    set_collectors:
    - uname
    - cpu
    - loadavg
    - meminfo
    - filesystem
    - netdev
    - diskstats
    - cpufreq
    - os
    - time
    - xfs
    - cpu_guest_seconds_metric
    - boottime
    - systemd
    - processes
    - nvme
    - nfs
    - netstat
    - logind
    - stat
    - vmstat
    relabel_configs:
    - action: replace
      replacement: '${INSTANCE_NAME}'
      target_label: instance
    - action: replace
      replacement: '${TENANT_ID}'
      target_label: tenant

metrics:
  wal_directory: /tmp/grafana-agent-wal
  global:
    scrape_interval: 30s
    remote_write:
    - url: https://mimir.${DOMAIN}/api/v1/push
      headers:
        X-Scope-OrgID: ${TENANT_ID}
  configs:
  - name: default
    scrape_configs:
    - job_name: agent
      static_configs:
      - targets: [ '127.0.0.1:9090' ]

logs:
  configs:
  - name: default
    clients:
    - url: https://loki.${DOMAIN}/loki/api/v1/push
      headers:
        X-Scope-OrgID: ${TENANT_ID}
    positions:
      filename: /tmp/positions.yaml
    scrape_configs:
    - job_name: docker
      docker_sd_configs:
      - host: "unix:///var/run/docker.sock"
        refresh_interval: 30s
      relabel_configs:
      - source_labels: [ '__meta_docker_container_name' ]
        target_label: 'container'
      - source_labels: [ '__meta_docker_container_name' ]
        target_label: 'service_name'
      - source_labels: [ '__meta_docker_container_log_stream' ]
        target_label: 'stream'
      - action: replace
        replacement: '${INSTANCE_NAME}'
        target_label: instance
      - action: replace
        replacement: '${TENANT_ID}'
        target_label: tenant
    - job_name: syslog
      static_configs:
      - targets: [ 'localhost' ]
        labels:
          job: syslog
          tenant: ${TENANT_ID}
      relabel_configs:
      - action: replace
        target_label: instance
        replacement: '${INSTANCE_NAME}'
      - action: replace
        target_label: tenant
        replacement: '${TENANT_ID}'
Enter fullscreen mode Exit fullscreen mode

7. Complete Deployment Script

#!/bin/bash

# Environment variables
export CLUSTER_NAME="production-cluster"
export TENANT_ID="company"
export DOMAIN="monitoring.company.com"
export ALB_GROUP_NAME="monitoring-alb"
export TLS_CERTIFICATE_ARN="arn:aws:acm:region:account:certificate/cert-id"
export AWS_REGION="us-east-1"
export LOKI_S3_BUCKET="company-loki-data"
export MIMIR_BLOCKS_BUCKET="company-mimir-blocks"
export MIMIR_ALERTS_BUCKET="company-mimir-alerts"
export MIMIR_RULES_BUCKET="company-mimir-rules"
export DATABASE_HOST="postgres.company.internal"
export DATABASE_USER="grafana"
export DATABASE_PASSWORD="secure-password"
export GRAFANA_ADMIN_USER="admin"
export GRAFANA_ADMIN_PASSWORD="admin-password"

# Add Helm repositories
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Create namespaces
kubectl create namespace monitoring
kubectl create namespace loki
kubectl create namespace mimir
kubectl create namespace grafana

# Create secrets
kubectl create secret generic aws-credentials \
  --from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
  --from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
  --namespace loki

kubectl create secret generic aws-credentials \
  --from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
  --from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
  --namespace mimir

kubectl create secret generic grafana-admin-credentials \
  --from-literal=admin-user=${GRAFANA_ADMIN_USER} \
  --from-literal=admin-password=${GRAFANA_ADMIN_PASSWORD} \
  --namespace grafana

# Deploy Loki
helm upgrade --install loki grafana/loki-distributed \
  --namespace loki \
  --values values-loki.yaml

# Deploy Mimir
helm upgrade --install mimir grafana/mimir-distributed \
  --namespace mimir \
  --values values-mimir.yaml

# Deploy Grafana
helm upgrade --install grafana grafana/grafana \
  --namespace grafana \
  --values values-grafana.yaml

# Deploy Alloy
helm upgrade --install alloy grafana/k8s-monitoring \
  --namespace monitoring \
  --values values-alloy.yaml

echo "Deployment completed! Wait a few minutes for all pods to be ready."
Enter fullscreen mode Exit fullscreen mode

8. Verification and Monitoring

Check pod status:

kubectl get pods -n monitoring
kubectl get pods -n loki  
kubectl get pods -n mimir
kubectl get pods -n grafana
Enter fullscreen mode Exit fullscreen mode

Check ingress:

kubectl get ingress -A
Enter fullscreen mode Exit fullscreen mode

Access logs:

kubectl logs -f deployment/alloy -n monitoring
kubectl logs -f deployment/loki-distributor -n loki
kubectl logs -f deployment/mimir-distributor -n mimir
Enter fullscreen mode Exit fullscreen mode

9. Grafana Data Sources Configuration

After deployment, configure the following data sources in Grafana:

Loki Data Source:

  • URL: http://loki-loki-distributed-gateway.loki.svc.cluster.local
  • Headers: X-Scope-OrgID: ${TENANT_ID}

Mimir Data Source:

  • URL: http://mimir-gateway.mimir.svc.cluster.local
  • Headers: X-Scope-OrgID: ${TENANT_ID}

10. Production Considerations

Security:

  • Use Kubernetes secrets for sensitive credentials
  • Configure appropriate RBAC
  • Enable TLS for all communications

Performance:

  • Adjust resource limits based on data volume
  • Configure appropriate retention
  • Use suitable storage classes

Backup:

  • Configure S3 data backup
  • Document restore procedures
  • Regularly test backups

Monitoring:

  • Configure alerts for component failures
  • Monitor resource usage
  • Set up health check dashboards

Conclusion

This implementation provides a complete observability solution for Kubernetes environments, with high availability, scalability, and native AWS integration. The stack is suitable for production environments and can be customized according to specific needs.

Next Steps:

  1. Configure custom dashboards
  2. Implement critical alerts
  3. Configure retention policies
  4. Optimize performance based on real metrics

Top comments (0)