DEV Community

Thiago da Silva
Thiago da Silva

Posted on

Implementação da Stack Grafana K8s em Cluster Kubernetes

Implementação do Stack Grafana K8s em Cluster Kubernetes

Introdução

Este artigo descreve a implementação completa de um stack de observabilidade baseado no ecossistema Grafana em um cluster Kubernetes. O stack inclui:

  • Grafana Alloy: Agente de coleta de métricas e logs
  • Loki: Sistema de agregação de logs
  • Mimir: Sistema de métricas de longa duração
  • Grafana: Interface de visualização e dashboards

Arquitetura da Solução

A solução utiliza uma arquitetura distribuída onde:

  1. Alloy coleta métricas e logs do cluster
  2. Loki armazena e indexa logs com retenção configurável
  3. Mimir armazena métricas de longa duração
  4. Grafana fornece a interface de visualização
  5. AWS Load Balancer Controller gerencia o ingress

Pré-requisitos

  • Cluster Kubernetes funcionando
  • Helm 3.x instalado
  • AWS Load Balancer Controller configurado
  • Certificados TLS válidos
  • Buckets S3 configurados para storage

1. Configuração do Grafana Alloy

O Alloy é responsável por coletar métricas e logs do cluster Kubernetes.

# values-alloy.yaml
alloy-logs:
  config:
    client:
      timeout: 30s
    resources:
      limits:
        memory: 256Mi
      requests:
        cpu: 100m
        memory: 128Mi
  enabled: true

alloy-metrics:
  config:
    global:
      external_labels:
        cluster: ${CLUSTER_NAME}
        source: alloy-agent
      scrape_interval: 3m
      scrape_timeout: 2m30s
    remote_write:
    - headers:
        X-Scope-OrgID: ${TENANT_ID}
      queue_config:
        batch_send_deadline: 30s
        max_backoff: 10s
        max_retries: 10
        max_samples_per_send: 1000
        min_backoff: 100ms
        retry_on_http_429: true
      send_exemplars: true
      url: http://mimir-gateway.mimir.svc.cluster.local/api/v1/push
  enabled: true
  resources:
    limits:
      memory: 768Mi
    requests:
      cpu: 250m
      memory: 384Mi

cluster:
  name: ${CLUSTER_NAME}

clusterEvents:
  enabled: true
  scrapeInterval: 3m

clusterMetrics:
  cadvisor:
    enabled: true
    metricRelabelings:
    - action: keep
      regex: container_cpu_.*|container_memory_.*|container_network_.*|container_fs_.*|machine_cpu_.*|container_spec_.*
      sourceLabels:
      - __name__
    relabelings:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
  enabled: true
  kubelet:
    enabled: true
  nodeExporter:
    enabled: true

externalServices:
  loki:
    externalLabels:
      cluster: ${CLUSTER_NAME}
      tenant: ${TENANT_ID}
    host: http://loki-loki-distributed-gateway.loki.svc.cluster.local
    tenantId: ${TENANT_ID}
    writeEndpoint: /loki/api/v1/push
  prometheus:
    extraHeaders:
      X-Scope-OrgID: ${TENANT_ID}
    host: http://mimir-gateway.mimir.svc.cluster.local
    tenantId: ${TENANT_ID}
    writeEndpoint: /api/v1/push

extraScrapeConfigs: |-
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: replace
        target_label: container
    scrape_interval: 3m
    scrape_timeout: 2m30s

  - job_name: 'kubernetes-cadvisor'
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecure_skip_verify: true
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    kubernetes_sd_configs:
    - role: node
    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
    - target_label: __address__
      replacement: kubernetes.default.svc:443
    - source_labels: [__meta_kubernetes_node_name]
      regex: (.+)
      target_label: __metrics_path__
      replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    metric_relabel_configs:
    - source_labels: [__name__]
      regex: 'container_cpu_.*|container_memory_.*|machine_cpu_.*|container_fs_.*|container_network_.*|container_spec_.*'
      action: keep
    scrape_interval: 3m
    scrape_timeout: 2m30s

kube-state-metrics:
  enabled: true
  metricLabelsAllowlist:
  - pods=[app,name,component]
  - deployments=[app,name,component]
  - statefulsets=[app,name,component]
  resources:
    limits:
      memory: 256Mi
    requests:
      cpu: 100m
      memory: 128Mi

logs:
  enabled: true
  extraClientConfig: |
    client:
      timeout: 10s
      batchwait: 1s 
      batchsize: 1048576
      max_retries: 5
      retry_on_status_codes: [429, 500, 502, 503, 504]
      backoff_config:
        min_period: 500ms
        max_period: 5m
        max_retries: 10
  pod_logs:
    discovery: all
    enabled: true
    namespaceAllowlist:
    - .*
    scrapeInterval: 3m
    stages:
    - json:
        expressions:
          level: level
          message: msg
          timestamp: time
    - timestamp:
        format: RFC3339Nano
        source: timestamp
    - labels:
        level: null

metrics:
  enabled: true
  scrapeInterval: 3m

opencost:
  enabled: false

prometheus-node-exporter:
  enabled: true

traces:
  enabled: false
Enter fullscreen mode Exit fullscreen mode

Instalação do Alloy

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Criar namespace
kubectl create namespace monitoring

# Instalar Alloy
helm upgrade --install alloy grafana/k8s-monitoring \
  --namespace monitoring \
  --values values-alloy.yaml \
  --set cluster.name=${CLUSTER_NAME} \
  --set externalServices.loki.tenantId=${TENANT_ID} \
  --set externalServices.prometheus.tenantId=${TENANT_ID}
Enter fullscreen mode Exit fullscreen mode

2. Configuração do Loki

Loki é responsável pela agregação e armazenamento de logs.

# values-loki.yaml
nameOverride: "loki-distributed"

compactor:
  enabled: true
  persistence:
    enabled: false
  replicas: 1

distributor:
  maxUnavailable: 1
  replicas: 2

extraEnvVars:
- name: AWS_ACCESS_KEY_ID
  valueFrom:
    secretKeyRef:
      name: aws-credentials
      key: access_key_id
- name: AWS_SECRET_ACCESS_KEY
  valueFrom:
    secretKeyRef:
      name: aws-credentials
      key: secret_access_key

gateway:
  enabled: true
  ingress:
    annotations:
      alb.ingress.kubernetes.io/certificate-arn: ${TLS_CERTIFICATE_ARN}
      alb.ingress.kubernetes.io/group.name: ${ALB_GROUP_NAME}
      alb.ingress.kubernetes.io/healthcheck-path: /ready
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/ssl-redirect: "443"
      alb.ingress.kubernetes.io/target-type: ip
      kubernetes.io/ingress.class: alb
    enabled: true
    hosts:
    - host: loki.${DOMAIN}
      paths:
      - path: /
        pathType: Prefix
    ingressClassName: alb
  replicas: 1
  service:
    port: 80
    type: ClusterIP

ingester:
  maxUnavailable: 1
  persistence:
    enabled: false
  replicas: 2

loki:
  commonConfig:
    path_prefix: /var/loki
    replication_factor: 1
  structuredConfig:
    auth_enabled: true
    common:
      storage:
        filesystem:
          chunks_directory: /var/loki/chunks
          rules_directory: /var/loki/rules
    limits_config:
      ingestion_burst_size_mb: 6
      ingestion_rate_mb: 4
      max_global_streams_per_user: 15000
      max_query_series: 500
      reject_old_samples: false
      reject_old_samples_max_age: 168h
      retention_period: 24h
      volume_enabled: true

    compactor:
      working_directory: /var/loki/compactor
      shared_store: s3
      compaction_interval: 10m
      retention_enabled: true
      retention_delete_delay: 1h
      retention_delete_worker_count: 150

    memberlist:
      join_members:
        - loki-loki-distributed-memberlist.loki.svc.cluster.local:7946

    schema_config:
      configs:
        - from: "2020-10-24"
          index:
            period: 24h
            prefix: index_
          object_store: aws
          schema: v11
          store: boltdb-shipper

    server:
      grpc_listen_port: 9095
      http_listen_port: 3100

    storage_config:
      aws:
        access_key_id: ${AWS_ACCESS_KEY_ID}
        bucketnames: ${LOKI_S3_BUCKET}
        region: ${AWS_REGION}
        s3: s3://s3.amazonaws.com
        secret_access_key: ${AWS_SECRET_ACCESS_KEY}
      boltdb_shipper:
        active_index_directory: /var/loki/index
        cache_location: /var/loki/boltdb-cache
        shared_store: s3

minio:
  enabled: false

nginx:
  enabled: false

querier:
  replicas: 1

query_frontend:
  maxUnavailable: 0
  replicas: 1

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 256Mi

ruler:
  enabled: false

singleBinary:
  enabled: false
Enter fullscreen mode Exit fullscreen mode

Instalação do Loki

# Criar secret para credenciais AWS
kubectl create secret generic aws-credentials \
  --from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
  --from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
  --namespace loki

# Instalar Loki
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install loki grafana/loki-distributed \
  --namespace loki \
  --create-namespace \
  --values values-loki.yaml
Enter fullscreen mode Exit fullscreen mode

3. Configuração do Mimir

Mimir é responsável pelo armazenamento de métricas de longa duração.

# values-mimir.yaml
alertmanager:
  enabled: false

compactor:
  persistentVolume:
    enabled: false
  replicas: 1

distributor:
  replicas: 2
  extraEnvVars:
  - name: AWS_ACCESS_KEY_ID
    valueFrom:
      secretKeyRef:
        name: aws-credentials
        key: access_key_id
  - name: AWS_SECRET_ACCESS_KEY
    valueFrom:
      secretKeyRef:
        name: aws-credentials
        key: secret_access_key

gateway:
  enabled: true
  enabledNonEnterprise: true
  ingress:
    annotations:
      alb.ingress.kubernetes.io/certificate-arn: ${TLS_CERTIFICATE_ARN}
      alb.ingress.kubernetes.io/group.name: ${ALB_GROUP_NAME}
      alb.ingress.kubernetes.io/healthcheck-path: /ready
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/ssl-redirect: "443"
      alb.ingress.kubernetes.io/target-type: ip
      kubernetes.io/ingress.class: alb
    enabled: true
    hosts:
      - host: mimir.${DOMAIN}
        paths:
          - path: /
            pathType: Prefix
    ingressClassName: alb
  replicas: 1
  service:
    port: 80
    type: ClusterIP

ingester:
  persistentVolume:
    enabled: false
  replicas: 2

mimir:
  structuredConfig:
    alertmanager_storage:
      backend: s3
      s3:
        bucket_name: ${MIMIR_ALERTS_BUCKET}
    blocks_storage:
      backend: s3
      bucket_store:
        sync_dir: /data/tsdb-sync
      s3:
        bucket_name: ${MIMIR_BLOCKS_BUCKET}
      tsdb:
        dir: /data/tsdb
    common:
      storage:
        backend: s3
        s3:
          access_key_id: ${AWS_ACCESS_KEY_ID}
          bucket_name: ${MIMIR_BLOCKS_BUCKET}
          endpoint: s3.amazonaws.com
          region: ${AWS_REGION}
          secret_access_key: ${AWS_SECRET_ACCESS_KEY}
    frontend:
      log_queries_longer_than: 10s
    limits:
      ingestion_burst_size: 150000
      ingestion_rate: 50000
      max_global_series_per_metric: 150000
      max_global_series_per_user: 2000000
    memberlist:
      abort_if_cluster_join_fails: false
      join_members:
        - mimir-gossip-ring.mimir.svc.cluster.local:7946
    multitenancy_enabled: true
    ruler_storage:
      backend: s3
      s3:
        bucket_name: ${MIMIR_RULES_BUCKET}
    server:
      grpc_listen_port: 9095
      http_listen_port: 8080

minio:
  enabled: false

mode: microservices

nginx:
  enabled: false

querier:
  replicas: 1

query_frontend:
  replicas: 1

query_scheduler:
  enabled: true
  replicas: 2
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 100m
      memory: 256Mi

ruler:
  enabled: false

store_gateway:
  persistentVolume:
    enabled: false
  replicas: 1
Enter fullscreen mode Exit fullscreen mode

Instalação do Mimir

# Criar secret para credenciais AWS (se não existir)
kubectl create secret generic aws-credentials \
  --from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
  --from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
  --namespace mimir

# Instalar Mimir
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install mimir grafana/mimir-distributed \
  --namespace mimir \
  --create-namespace \
  --values values-mimir.yaml
Enter fullscreen mode Exit fullscreen mode

4. Configuração do Grafana

# values-grafana.yaml
fullnameOverride: "grafana"

# Image configuration
image:
  repository: grafana/grafana
  tag: "12.0.0"

# Admin credentials using secret
admin:
  existingSecret: "grafana-admin-credentials"
  userKey: "admin-user"
  passwordKey: "admin-password"

# Service configuration
service:
  type: ClusterIP
  port: 80

# Persistence
persistence:
  enabled: true
  size: 5Gi

# Database configuration using secrets
env:
  GF_DATABASE_TYPE: postgres
  GF_DATABASE_HOST: ${DATABASE_HOST}
  GF_DATABASE_PORT: "5432"
  GF_DATABASE_NAME: grafana
  GF_DATABASE_USER: ${DATABASE_USER}
  GF_DATABASE_PASSWORD: ${DATABASE_PASSWORD}
  GF_DATABASE_SSL_MODE: disable
  GF_LOG_LEVEL: warn

# Grafana configuration
grafana.ini:
  server:
    domain: grafana.${DOMAIN}
    root_url: https://grafana.${DOMAIN}

# Ingress configuration
ingress:
  enabled: true
  ingressClassName: alb
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/group.name: ${ALB_GROUP_NAME}
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/healthcheck-path: /api/health
    alb.ingress.kubernetes.io/certificate-arn: ${TLS_CERTIFICATE_ARN}
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
    alb.ingress.kubernetes.io/ssl-redirect: '443'
  hosts:
    - grafana.${DOMAIN}
  path: /
  pathType: Prefix
Enter fullscreen mode Exit fullscreen mode

Instalação do Grafana

# Criar secrets necessários
kubectl create secret generic grafana-admin-credentials \
  --from-literal=admin-user=${GRAFANA_ADMIN_USER} \
  --from-literal=admin-password=${GRAFANA_ADMIN_PASSWORD} \
  --namespace grafana

# Instalar Grafana
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install grafana grafana/grafana \
  --namespace grafana \
  --create-namespace \
  --values values-grafana.yaml
Enter fullscreen mode Exit fullscreen mode

5. Configuração do IngressClass

# ingress-class.yaml
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: alb
spec:
  controller: ingress.k8s.aws/alb
Enter fullscreen mode Exit fullscreen mode
kubectl apply -f ingress-class.yaml
Enter fullscreen mode Exit fullscreen mode

6. Configuração do Grafana Agent (Standalone)

Para monitoramento de recursos externos ao cluster:

# agent-config.yaml
server:
  log_level: debug

integrations:
  node_exporter:
    enabled: true
    rootfs_path: /
    sysfs_path: /sys
    procfs_path: /proc
    set_collectors:
    - uname
    - cpu
    - loadavg
    - meminfo
    - filesystem
    - netdev
    - diskstats
    - cpufreq
    - os
    - time
    - xfs
    - cpu_guest_seconds_metric
    - boottime
    - systemd
    - processes
    - nvme
    - nfs
    - netstat
    - logind
    - stat
    - vmstat
    relabel_configs:
    - action: replace
      replacement: '${INSTANCE_NAME}'
      target_label: instance
    - action: replace
      replacement: '${TENANT_ID}'
      target_label: tenant

metrics:
  wal_directory: /tmp/grafana-agent-wal
  global:
    scrape_interval: 30s
    remote_write:
    - url: https://mimir.${DOMAIN}/api/v1/push
      headers:
        X-Scope-OrgID: ${TENANT_ID}
  configs:
  - name: default
    scrape_configs:
    - job_name: agent
      static_configs:
      - targets: [ '127.0.0.1:9090' ]

logs:
  configs:
  - name: default
    clients:
    - url: https://loki.${DOMAIN}/loki/api/v1/push
      headers:
        X-Scope-OrgID: ${TENANT_ID}
    positions:
      filename: /tmp/positions.yaml
    scrape_configs:
    - job_name: docker
      docker_sd_configs:
      - host: "unix:///var/run/docker.sock"
        refresh_interval: 30s
      relabel_configs:
      - source_labels: [ '__meta_docker_container_name' ]
        target_label: 'container'
      - source_labels: [ '__meta_docker_container_name' ]
        target_label: 'service_name'
      - source_labels: [ '__meta_docker_container_log_stream' ]
        target_label: 'stream'
      - action: replace
        replacement: '${INSTANCE_NAME}'
        target_label: instance
      - action: replace
        replacement: '${TENANT_ID}'
        target_label: tenant
    - job_name: syslog
      static_configs:
      - targets: [ 'localhost' ]
        labels:
          job: syslog
          tenant: ${TENANT_ID}
      relabel_configs:
      - action: replace
        target_label: instance
        replacement: '${INSTANCE_NAME}'
      - action: replace
        target_label: tenant
        replacement: '${TENANT_ID}'
Enter fullscreen mode Exit fullscreen mode

7. Script de Deploy Completo

#!/bin/bash

# Variáveis de ambiente
export CLUSTER_NAME="production-cluster"
export TENANT_ID="company"
export DOMAIN="monitoring.company.com"
export ALB_GROUP_NAME="monitoring-alb"
export TLS_CERTIFICATE_ARN="arn:aws:acm:region:account:certificate/cert-id"
export AWS_REGION="us-east-1"
export LOKI_S3_BUCKET="company-loki-data"
export MIMIR_BLOCKS_BUCKET="company-mimir-blocks"
export MIMIR_ALERTS_BUCKET="company-mimir-alerts"
export MIMIR_RULES_BUCKET="company-mimir-rules"
export DATABASE_HOST="postgres.company.internal"
export DATABASE_USER="grafana"
export DATABASE_PASSWORD="secure-password"
export GRAFANA_ADMIN_USER="admin"
export GRAFANA_ADMIN_PASSWORD="admin-password"

# Adicionar repositórios Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Criar namespaces
kubectl create namespace monitoring
kubectl create namespace loki
kubectl create namespace mimir
kubectl create namespace grafana

# Criar secrets
kubectl create secret generic aws-credentials \
  --from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
  --from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
  --namespace loki

kubectl create secret generic aws-credentials \
  --from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
  --from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
  --namespace mimir

kubectl create secret generic grafana-admin-credentials \
  --from-literal=admin-user=${GRAFANA_ADMIN_USER} \
  --from-literal=admin-password=${GRAFANA_ADMIN_PASSWORD} \
  --namespace grafana

# Deploy Loki
helm upgrade --install loki grafana/loki-distributed \
  --namespace loki \
  --values values-loki.yaml

# Deploy Mimir
helm upgrade --install mimir grafana/mimir-distributed \
  --namespace mimir \
  --values values-mimir.yaml

# Deploy Grafana
helm upgrade --install grafana grafana/grafana \
  --namespace grafana \
  --values values-grafana.yaml

# Deploy Alloy
helm upgrade --install alloy grafana/k8s-monitoring \
  --namespace monitoring \
  --values values-alloy.yaml

echo "Deploy concluído! Aguarde alguns minutos para todos os pods ficarem prontos."
Enter fullscreen mode Exit fullscreen mode

8. Verificação e Monitoramento

Verificar status dos pods:

kubectl get pods -n monitoring
kubectl get pods -n loki  
kubectl get pods -n mimir
kubectl get pods -n grafana
Enter fullscreen mode Exit fullscreen mode

Verificar ingress:

kubectl get ingress -A
Enter fullscreen mode Exit fullscreen mode

Acessar logs:

kubectl logs -f deployment/alloy -n monitoring
kubectl logs -f deployment/loki-distributor -n loki
kubectl logs -f deployment/mimir-distributor -n mimir
Enter fullscreen mode Exit fullscreen mode

9. Configuração de Data Sources no Grafana

Após o deploy, configure as seguintes data sources no Grafana:

Loki Data Source:

  • URL: http://loki-loki-distributed-gateway.loki.svc.cluster.local
  • Headers: X-Scope-OrgID: ${TENANT_ID}

Mimir Data Source:

  • URL: http://mimir-gateway.mimir.svc.cluster.local
  • Headers: X-Scope-OrgID: ${TENANT_ID}

10. Considerações de Produção

Segurança:

  • Use secrets do Kubernetes para credenciais sensíveis
  • Configure RBAC apropriado
  • Habilite TLS em todas as comunicações

Performance:

  • Ajuste limites de recursos baseado no volume de dados
  • Configure retenção apropriada
  • Use storage classes adequados

Backup:

  • Configure backup dos dados no S3
  • Documente procedimentos de restore
  • Teste regularmente os backups

Monitoramento:

  • Configure alertas para falhas de componentes
  • Monitore uso de recursos
  • Configure dashboards de health check

Conclusão

Esta implementação fornece uma solução completa de observabilidade para ambientes Kubernetes, com alta disponibilidade, escalabilidade e integração nativa com AWS. O stack é adequado para ambientes de produção e pode ser customizado conforme necessidades específicas.

Top comments (1)

Collapse
 
jairo_b5ec4fe38593d947708 profile image
Jairo

Sucesso de mais!!
Já vou utilizar pra lab pra aprender mais sobre o tema.