Implementação do Stack Grafana K8s em Cluster Kubernetes
Introdução
Este artigo descreve a implementação completa de um stack de observabilidade baseado no ecossistema Grafana em um cluster Kubernetes. O stack inclui:
- Grafana Alloy: Agente de coleta de métricas e logs
- Loki: Sistema de agregação de logs
- Mimir: Sistema de métricas de longa duração
- Grafana: Interface de visualização e dashboards
Arquitetura da Solução
A solução utiliza uma arquitetura distribuída onde:
- Alloy coleta métricas e logs do cluster
- Loki armazena e indexa logs com retenção configurável
- Mimir armazena métricas de longa duração
- Grafana fornece a interface de visualização
- AWS Load Balancer Controller gerencia o ingress
Pré-requisitos
- Cluster Kubernetes funcionando
- Helm 3.x instalado
- AWS Load Balancer Controller configurado
- Certificados TLS válidos
- Buckets S3 configurados para storage
1. Configuração do Grafana Alloy
O Alloy é responsável por coletar métricas e logs do cluster Kubernetes.
# values-alloy.yaml
alloy-logs:
config:
client:
timeout: 30s
resources:
limits:
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
enabled: true
alloy-metrics:
config:
global:
external_labels:
cluster: ${CLUSTER_NAME}
source: alloy-agent
scrape_interval: 3m
scrape_timeout: 2m30s
remote_write:
- headers:
X-Scope-OrgID: ${TENANT_ID}
queue_config:
batch_send_deadline: 30s
max_backoff: 10s
max_retries: 10
max_samples_per_send: 1000
min_backoff: 100ms
retry_on_http_429: true
send_exemplars: true
url: http://mimir-gateway.mimir.svc.cluster.local/api/v1/push
enabled: true
resources:
limits:
memory: 768Mi
requests:
cpu: 250m
memory: 384Mi
cluster:
name: ${CLUSTER_NAME}
clusterEvents:
enabled: true
scrapeInterval: 3m
clusterMetrics:
cadvisor:
enabled: true
metricRelabelings:
- action: keep
regex: container_cpu_.*|container_memory_.*|container_network_.*|container_fs_.*|machine_cpu_.*|container_spec_.*
sourceLabels:
- __name__
relabelings:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
enabled: true
kubelet:
enabled: true
nodeExporter:
enabled: true
externalServices:
loki:
externalLabels:
cluster: ${CLUSTER_NAME}
tenant: ${TENANT_ID}
host: http://loki-loki-distributed-gateway.loki.svc.cluster.local
tenantId: ${TENANT_ID}
writeEndpoint: /loki/api/v1/push
prometheus:
extraHeaders:
X-Scope-OrgID: ${TENANT_ID}
host: http://mimir-gateway.mimir.svc.cluster.local
tenantId: ${TENANT_ID}
writeEndpoint: /api/v1/push
extraScrapeConfigs: |-
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_pod_container_name]
action: replace
target_label: container
scrape_interval: 3m
scrape_timeout: 2m30s
- job_name: 'kubernetes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
metric_relabel_configs:
- source_labels: [__name__]
regex: 'container_cpu_.*|container_memory_.*|machine_cpu_.*|container_fs_.*|container_network_.*|container_spec_.*'
action: keep
scrape_interval: 3m
scrape_timeout: 2m30s
kube-state-metrics:
enabled: true
metricLabelsAllowlist:
- pods=[app,name,component]
- deployments=[app,name,component]
- statefulsets=[app,name,component]
resources:
limits:
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
logs:
enabled: true
extraClientConfig: |
client:
timeout: 10s
batchwait: 1s
batchsize: 1048576
max_retries: 5
retry_on_status_codes: [429, 500, 502, 503, 504]
backoff_config:
min_period: 500ms
max_period: 5m
max_retries: 10
pod_logs:
discovery: all
enabled: true
namespaceAllowlist:
- .*
scrapeInterval: 3m
stages:
- json:
expressions:
level: level
message: msg
timestamp: time
- timestamp:
format: RFC3339Nano
source: timestamp
- labels:
level: null
metrics:
enabled: true
scrapeInterval: 3m
opencost:
enabled: false
prometheus-node-exporter:
enabled: true
traces:
enabled: false
Instalação do Alloy
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Criar namespace
kubectl create namespace monitoring
# Instalar Alloy
helm upgrade --install alloy grafana/k8s-monitoring \
--namespace monitoring \
--values values-alloy.yaml \
--set cluster.name=${CLUSTER_NAME} \
--set externalServices.loki.tenantId=${TENANT_ID} \
--set externalServices.prometheus.tenantId=${TENANT_ID}
2. Configuração do Loki
Loki é responsável pela agregação e armazenamento de logs.
# values-loki.yaml
nameOverride: "loki-distributed"
compactor:
enabled: true
persistence:
enabled: false
replicas: 1
distributor:
maxUnavailable: 1
replicas: 2
extraEnvVars:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access_key_id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret_access_key
gateway:
enabled: true
ingress:
annotations:
alb.ingress.kubernetes.io/certificate-arn: ${TLS_CERTIFICATE_ARN}
alb.ingress.kubernetes.io/group.name: ${ALB_GROUP_NAME}
alb.ingress.kubernetes.io/healthcheck-path: /ready
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/ssl-redirect: "443"
alb.ingress.kubernetes.io/target-type: ip
kubernetes.io/ingress.class: alb
enabled: true
hosts:
- host: loki.${DOMAIN}
paths:
- path: /
pathType: Prefix
ingressClassName: alb
replicas: 1
service:
port: 80
type: ClusterIP
ingester:
maxUnavailable: 1
persistence:
enabled: false
replicas: 2
loki:
commonConfig:
path_prefix: /var/loki
replication_factor: 1
structuredConfig:
auth_enabled: true
common:
storage:
filesystem:
chunks_directory: /var/loki/chunks
rules_directory: /var/loki/rules
limits_config:
ingestion_burst_size_mb: 6
ingestion_rate_mb: 4
max_global_streams_per_user: 15000
max_query_series: 500
reject_old_samples: false
reject_old_samples_max_age: 168h
retention_period: 24h
volume_enabled: true
compactor:
working_directory: /var/loki/compactor
shared_store: s3
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 1h
retention_delete_worker_count: 150
memberlist:
join_members:
- loki-loki-distributed-memberlist.loki.svc.cluster.local:7946
schema_config:
configs:
- from: "2020-10-24"
index:
period: 24h
prefix: index_
object_store: aws
schema: v11
store: boltdb-shipper
server:
grpc_listen_port: 9095
http_listen_port: 3100
storage_config:
aws:
access_key_id: ${AWS_ACCESS_KEY_ID}
bucketnames: ${LOKI_S3_BUCKET}
region: ${AWS_REGION}
s3: s3://s3.amazonaws.com
secret_access_key: ${AWS_SECRET_ACCESS_KEY}
boltdb_shipper:
active_index_directory: /var/loki/index
cache_location: /var/loki/boltdb-cache
shared_store: s3
minio:
enabled: false
nginx:
enabled: false
querier:
replicas: 1
query_frontend:
maxUnavailable: 0
replicas: 1
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 256Mi
ruler:
enabled: false
singleBinary:
enabled: false
Instalação do Loki
# Criar secret para credenciais AWS
kubectl create secret generic aws-credentials \
--from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
--from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
--namespace loki
# Instalar Loki
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install loki grafana/loki-distributed \
--namespace loki \
--create-namespace \
--values values-loki.yaml
3. Configuração do Mimir
Mimir é responsável pelo armazenamento de métricas de longa duração.
# values-mimir.yaml
alertmanager:
enabled: false
compactor:
persistentVolume:
enabled: false
replicas: 1
distributor:
replicas: 2
extraEnvVars:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access_key_id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret_access_key
gateway:
enabled: true
enabledNonEnterprise: true
ingress:
annotations:
alb.ingress.kubernetes.io/certificate-arn: ${TLS_CERTIFICATE_ARN}
alb.ingress.kubernetes.io/group.name: ${ALB_GROUP_NAME}
alb.ingress.kubernetes.io/healthcheck-path: /ready
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/ssl-redirect: "443"
alb.ingress.kubernetes.io/target-type: ip
kubernetes.io/ingress.class: alb
enabled: true
hosts:
- host: mimir.${DOMAIN}
paths:
- path: /
pathType: Prefix
ingressClassName: alb
replicas: 1
service:
port: 80
type: ClusterIP
ingester:
persistentVolume:
enabled: false
replicas: 2
mimir:
structuredConfig:
alertmanager_storage:
backend: s3
s3:
bucket_name: ${MIMIR_ALERTS_BUCKET}
blocks_storage:
backend: s3
bucket_store:
sync_dir: /data/tsdb-sync
s3:
bucket_name: ${MIMIR_BLOCKS_BUCKET}
tsdb:
dir: /data/tsdb
common:
storage:
backend: s3
s3:
access_key_id: ${AWS_ACCESS_KEY_ID}
bucket_name: ${MIMIR_BLOCKS_BUCKET}
endpoint: s3.amazonaws.com
region: ${AWS_REGION}
secret_access_key: ${AWS_SECRET_ACCESS_KEY}
frontend:
log_queries_longer_than: 10s
limits:
ingestion_burst_size: 150000
ingestion_rate: 50000
max_global_series_per_metric: 150000
max_global_series_per_user: 2000000
memberlist:
abort_if_cluster_join_fails: false
join_members:
- mimir-gossip-ring.mimir.svc.cluster.local:7946
multitenancy_enabled: true
ruler_storage:
backend: s3
s3:
bucket_name: ${MIMIR_RULES_BUCKET}
server:
grpc_listen_port: 9095
http_listen_port: 8080
minio:
enabled: false
mode: microservices
nginx:
enabled: false
querier:
replicas: 1
query_frontend:
replicas: 1
query_scheduler:
enabled: true
replicas: 2
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 256Mi
ruler:
enabled: false
store_gateway:
persistentVolume:
enabled: false
replicas: 1
Instalação do Mimir
# Criar secret para credenciais AWS (se não existir)
kubectl create secret generic aws-credentials \
--from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
--from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
--namespace mimir
# Instalar Mimir
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install mimir grafana/mimir-distributed \
--namespace mimir \
--create-namespace \
--values values-mimir.yaml
4. Configuração do Grafana
# values-grafana.yaml
fullnameOverride: "grafana"
# Image configuration
image:
repository: grafana/grafana
tag: "12.0.0"
# Admin credentials using secret
admin:
existingSecret: "grafana-admin-credentials"
userKey: "admin-user"
passwordKey: "admin-password"
# Service configuration
service:
type: ClusterIP
port: 80
# Persistence
persistence:
enabled: true
size: 5Gi
# Database configuration using secrets
env:
GF_DATABASE_TYPE: postgres
GF_DATABASE_HOST: ${DATABASE_HOST}
GF_DATABASE_PORT: "5432"
GF_DATABASE_NAME: grafana
GF_DATABASE_USER: ${DATABASE_USER}
GF_DATABASE_PASSWORD: ${DATABASE_PASSWORD}
GF_DATABASE_SSL_MODE: disable
GF_LOG_LEVEL: warn
# Grafana configuration
grafana.ini:
server:
domain: grafana.${DOMAIN}
root_url: https://grafana.${DOMAIN}
# Ingress configuration
ingress:
enabled: true
ingressClassName: alb
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/group.name: ${ALB_GROUP_NAME}
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/healthcheck-path: /api/health
alb.ingress.kubernetes.io/certificate-arn: ${TLS_CERTIFICATE_ARN}
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
alb.ingress.kubernetes.io/ssl-redirect: '443'
hosts:
- grafana.${DOMAIN}
path: /
pathType: Prefix
Instalação do Grafana
# Criar secrets necessários
kubectl create secret generic grafana-admin-credentials \
--from-literal=admin-user=${GRAFANA_ADMIN_USER} \
--from-literal=admin-password=${GRAFANA_ADMIN_PASSWORD} \
--namespace grafana
# Instalar Grafana
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install grafana grafana/grafana \
--namespace grafana \
--create-namespace \
--values values-grafana.yaml
5. Configuração do IngressClass
# ingress-class.yaml
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: alb
spec:
controller: ingress.k8s.aws/alb
kubectl apply -f ingress-class.yaml
6. Configuração do Grafana Agent (Standalone)
Para monitoramento de recursos externos ao cluster:
# agent-config.yaml
server:
log_level: debug
integrations:
node_exporter:
enabled: true
rootfs_path: /
sysfs_path: /sys
procfs_path: /proc
set_collectors:
- uname
- cpu
- loadavg
- meminfo
- filesystem
- netdev
- diskstats
- cpufreq
- os
- time
- xfs
- cpu_guest_seconds_metric
- boottime
- systemd
- processes
- nvme
- nfs
- netstat
- logind
- stat
- vmstat
relabel_configs:
- action: replace
replacement: '${INSTANCE_NAME}'
target_label: instance
- action: replace
replacement: '${TENANT_ID}'
target_label: tenant
metrics:
wal_directory: /tmp/grafana-agent-wal
global:
scrape_interval: 30s
remote_write:
- url: https://mimir.${DOMAIN}/api/v1/push
headers:
X-Scope-OrgID: ${TENANT_ID}
configs:
- name: default
scrape_configs:
- job_name: agent
static_configs:
- targets: [ '127.0.0.1:9090' ]
logs:
configs:
- name: default
clients:
- url: https://loki.${DOMAIN}/loki/api/v1/push
headers:
X-Scope-OrgID: ${TENANT_ID}
positions:
filename: /tmp/positions.yaml
scrape_configs:
- job_name: docker
docker_sd_configs:
- host: "unix:///var/run/docker.sock"
refresh_interval: 30s
relabel_configs:
- source_labels: [ '__meta_docker_container_name' ]
target_label: 'container'
- source_labels: [ '__meta_docker_container_name' ]
target_label: 'service_name'
- source_labels: [ '__meta_docker_container_log_stream' ]
target_label: 'stream'
- action: replace
replacement: '${INSTANCE_NAME}'
target_label: instance
- action: replace
replacement: '${TENANT_ID}'
target_label: tenant
- job_name: syslog
static_configs:
- targets: [ 'localhost' ]
labels:
job: syslog
tenant: ${TENANT_ID}
relabel_configs:
- action: replace
target_label: instance
replacement: '${INSTANCE_NAME}'
- action: replace
target_label: tenant
replacement: '${TENANT_ID}'
7. Script de Deploy Completo
#!/bin/bash
# Variáveis de ambiente
export CLUSTER_NAME="production-cluster"
export TENANT_ID="company"
export DOMAIN="monitoring.company.com"
export ALB_GROUP_NAME="monitoring-alb"
export TLS_CERTIFICATE_ARN="arn:aws:acm:region:account:certificate/cert-id"
export AWS_REGION="us-east-1"
export LOKI_S3_BUCKET="company-loki-data"
export MIMIR_BLOCKS_BUCKET="company-mimir-blocks"
export MIMIR_ALERTS_BUCKET="company-mimir-alerts"
export MIMIR_RULES_BUCKET="company-mimir-rules"
export DATABASE_HOST="postgres.company.internal"
export DATABASE_USER="grafana"
export DATABASE_PASSWORD="secure-password"
export GRAFANA_ADMIN_USER="admin"
export GRAFANA_ADMIN_PASSWORD="admin-password"
# Adicionar repositórios Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Criar namespaces
kubectl create namespace monitoring
kubectl create namespace loki
kubectl create namespace mimir
kubectl create namespace grafana
# Criar secrets
kubectl create secret generic aws-credentials \
--from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
--from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
--namespace loki
kubectl create secret generic aws-credentials \
--from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
--from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
--namespace mimir
kubectl create secret generic grafana-admin-credentials \
--from-literal=admin-user=${GRAFANA_ADMIN_USER} \
--from-literal=admin-password=${GRAFANA_ADMIN_PASSWORD} \
--namespace grafana
# Deploy Loki
helm upgrade --install loki grafana/loki-distributed \
--namespace loki \
--values values-loki.yaml
# Deploy Mimir
helm upgrade --install mimir grafana/mimir-distributed \
--namespace mimir \
--values values-mimir.yaml
# Deploy Grafana
helm upgrade --install grafana grafana/grafana \
--namespace grafana \
--values values-grafana.yaml
# Deploy Alloy
helm upgrade --install alloy grafana/k8s-monitoring \
--namespace monitoring \
--values values-alloy.yaml
echo "Deploy concluído! Aguarde alguns minutos para todos os pods ficarem prontos."
8. Verificação e Monitoramento
Verificar status dos pods:
kubectl get pods -n monitoring
kubectl get pods -n loki
kubectl get pods -n mimir
kubectl get pods -n grafana
Verificar ingress:
kubectl get ingress -A
Acessar logs:
kubectl logs -f deployment/alloy -n monitoring
kubectl logs -f deployment/loki-distributor -n loki
kubectl logs -f deployment/mimir-distributor -n mimir
9. Configuração de Data Sources no Grafana
Após o deploy, configure as seguintes data sources no Grafana:
Loki Data Source:
- URL:
http://loki-loki-distributed-gateway.loki.svc.cluster.local
- Headers:
X-Scope-OrgID: ${TENANT_ID}
Mimir Data Source:
- URL:
http://mimir-gateway.mimir.svc.cluster.local
- Headers:
X-Scope-OrgID: ${TENANT_ID}
10. Considerações de Produção
Segurança:
- Use secrets do Kubernetes para credenciais sensíveis
- Configure RBAC apropriado
- Habilite TLS em todas as comunicações
Performance:
- Ajuste limites de recursos baseado no volume de dados
- Configure retenção apropriada
- Use storage classes adequados
Backup:
- Configure backup dos dados no S3
- Documente procedimentos de restore
- Teste regularmente os backups
Monitoramento:
- Configure alertas para falhas de componentes
- Monitore uso de recursos
- Configure dashboards de health check
Conclusão
Esta implementação fornece uma solução completa de observabilidade para ambientes Kubernetes, com alta disponibilidade, escalabilidade e integração nativa com AWS. O stack é adequado para ambientes de produção e pode ser customizado conforme necessidades específicas.
Top comments (1)
Sucesso de mais!!
Já vou utilizar pra lab pra aprender mais sobre o tema.