Grafana K8s Stack Implementation on Kubernetes Cluster
Introduction
This article describes the complete implementation of an observability stack based on the Grafana ecosystem in a Kubernetes cluster. The stack includes:
- Grafana Alloy: Metrics and logs collection agent
- Loki: Log aggregation system
- Mimir: Long-term metrics storage system
- Grafana: Visualization interface and dashboards
Solution Architecture
The solution uses a distributed architecture where:
- Alloy collects metrics and logs from the cluster
- Loki stores and indexes logs with configurable retention
- Mimir stores long-term metrics
- Grafana provides the visualization interface
- AWS Load Balancer Controller manages ingress
Prerequisites
- Running Kubernetes cluster
- Helm 3.x installed
- AWS Load Balancer Controller configured
- Valid TLS certificates
- S3 buckets configured for storage
1. Grafana Alloy Configuration
Alloy is responsible for collecting metrics and logs from the Kubernetes cluster.
# values-alloy.yaml
alloy-logs:
config:
client:
timeout: 30s
resources:
limits:
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
enabled: true
alloy-metrics:
config:
global:
external_labels:
cluster: ${CLUSTER_NAME}
source: alloy-agent
scrape_interval: 3m
scrape_timeout: 2m30s
remote_write:
- headers:
X-Scope-OrgID: ${TENANT_ID}
queue_config:
batch_send_deadline: 30s
max_backoff: 10s
max_retries: 10
max_samples_per_send: 1000
min_backoff: 100ms
retry_on_http_429: true
send_exemplars: true
url: http://mimir-gateway.mimir.svc.cluster.local/api/v1/push
enabled: true
resources:
limits:
memory: 768Mi
requests:
cpu: 250m
memory: 384Mi
cluster:
name: ${CLUSTER_NAME}
clusterEvents:
enabled: true
scrapeInterval: 3m
clusterMetrics:
cadvisor:
enabled: true
metricRelabelings:
- action: keep
regex: container_cpu_.*|container_memory_.*|container_network_.*|container_fs_.*|machine_cpu_.*|container_spec_.*
sourceLabels:
- __name__
relabelings:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
enabled: true
kubelet:
enabled: true
nodeExporter:
enabled: true
externalServices:
loki:
externalLabels:
cluster: ${CLUSTER_NAME}
tenant: ${TENANT_ID}
host: http://loki-loki-distributed-gateway.loki.svc.cluster.local
tenantId: ${TENANT_ID}
writeEndpoint: /loki/api/v1/push
prometheus:
extraHeaders:
X-Scope-OrgID: ${TENANT_ID}
host: http://mimir-gateway.mimir.svc.cluster.local
tenantId: ${TENANT_ID}
writeEndpoint: /api/v1/push
extraScrapeConfigs: |-
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_pod_container_name]
action: replace
target_label: container
scrape_interval: 3m
scrape_timeout: 2m30s
- job_name: 'kubernetes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
metric_relabel_configs:
- source_labels: [__name__]
regex: 'container_cpu_.*|container_memory_.*|machine_cpu_.*|container_fs_.*|container_network_.*|container_spec_.*'
action: keep
scrape_interval: 3m
scrape_timeout: 2m30s
kube-state-metrics:
enabled: true
metricLabelsAllowlist:
- pods=[app,name,component]
- deployments=[app,name,component]
- statefulsets=[app,name,component]
resources:
limits:
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
logs:
enabled: true
extraClientConfig: |
client:
timeout: 10s
batchwait: 1s
batchsize: 1048576
max_retries: 5
retry_on_status_codes: [429, 500, 502, 503, 504]
backoff_config:
min_period: 500ms
max_period: 5m
max_retries: 10
pod_logs:
discovery: all
enabled: true
namespaceAllowlist:
- .*
scrapeInterval: 3m
stages:
- json:
expressions:
level: level
message: msg
timestamp: time
- timestamp:
format: RFC3339Nano
source: timestamp
- labels:
level: null
metrics:
enabled: true
scrapeInterval: 3m
opencost:
enabled: false
prometheus-node-exporter:
enabled: true
traces:
enabled: false
Alloy Installation
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Create namespace
kubectl create namespace monitoring
# Install Alloy
helm upgrade --install alloy grafana/k8s-monitoring \
--namespace monitoring \
--values values-alloy.yaml \
--set cluster.name=${CLUSTER_NAME} \
--set externalServices.loki.tenantId=${TENANT_ID} \
--set externalServices.prometheus.tenantId=${TENANT_ID}
2. Loki Configuration
Loki is responsible for log aggregation and storage.
# values-loki.yaml
nameOverride: "loki-distributed"
compactor:
enabled: true
persistence:
enabled: false
replicas: 1
distributor:
maxUnavailable: 1
replicas: 2
extraEnvVars:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access_key_id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret_access_key
gateway:
enabled: true
ingress:
annotations:
alb.ingress.kubernetes.io/certificate-arn: ${TLS_CERTIFICATE_ARN}
alb.ingress.kubernetes.io/group.name: ${ALB_GROUP_NAME}
alb.ingress.kubernetes.io/healthcheck-path: /ready
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/ssl-redirect: "443"
alb.ingress.kubernetes.io/target-type: ip
kubernetes.io/ingress.class: alb
enabled: true
hosts:
- host: loki.${DOMAIN}
paths:
- path: /
pathType: Prefix
ingressClassName: alb
replicas: 1
service:
port: 80
type: ClusterIP
ingester:
maxUnavailable: 1
persistence:
enabled: false
replicas: 2
loki:
commonConfig:
path_prefix: /var/loki
replication_factor: 1
structuredConfig:
auth_enabled: true
common:
storage:
filesystem:
chunks_directory: /var/loki/chunks
rules_directory: /var/loki/rules
limits_config:
ingestion_burst_size_mb: 6
ingestion_rate_mb: 4
max_global_streams_per_user: 15000
max_query_series: 500
reject_old_samples: false
reject_old_samples_max_age: 168h
retention_period: 24h
volume_enabled: true
compactor:
working_directory: /var/loki/compactor
shared_store: s3
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 1h
retention_delete_worker_count: 150
memberlist:
join_members:
- loki-loki-distributed-memberlist.loki.svc.cluster.local:7946
schema_config:
configs:
- from: "2020-10-24"
index:
period: 24h
prefix: index_
object_store: aws
schema: v11
store: boltdb-shipper
server:
grpc_listen_port: 9095
http_listen_port: 3100
storage_config:
aws:
access_key_id: ${AWS_ACCESS_KEY_ID}
bucketnames: ${LOKI_S3_BUCKET}
region: ${AWS_REGION}
s3: s3://s3.amazonaws.com
secret_access_key: ${AWS_SECRET_ACCESS_KEY}
boltdb_shipper:
active_index_directory: /var/loki/index
cache_location: /var/loki/boltdb-cache
shared_store: s3
minio:
enabled: false
nginx:
enabled: false
querier:
replicas: 1
query_frontend:
maxUnavailable: 0
replicas: 1
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 256Mi
ruler:
enabled: false
singleBinary:
enabled: false
Loki Installation
# Create secret for AWS credentials
kubectl create secret generic aws-credentials \
--from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
--from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
--namespace loki
# Install Loki
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install loki grafana/loki-distributed \
--namespace loki \
--create-namespace \
--values values-loki.yaml
3. Mimir Configuration
Mimir is responsible for long-term metrics storage.
# values-mimir.yaml
alertmanager:
enabled: false
compactor:
persistentVolume:
enabled: false
replicas: 1
distributor:
replicas: 2
extraEnvVars:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access_key_id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret_access_key
gateway:
enabled: true
enabledNonEnterprise: true
ingress:
annotations:
alb.ingress.kubernetes.io/certificate-arn: ${TLS_CERTIFICATE_ARN}
alb.ingress.kubernetes.io/group.name: ${ALB_GROUP_NAME}
alb.ingress.kubernetes.io/healthcheck-path: /ready
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/ssl-redirect: "443"
alb.ingress.kubernetes.io/target-type: ip
kubernetes.io/ingress.class: alb
enabled: true
hosts:
- host: mimir.${DOMAIN}
paths:
- path: /
pathType: Prefix
ingressClassName: alb
replicas: 1
service:
port: 80
type: ClusterIP
ingester:
persistentVolume:
enabled: false
replicas: 2
mimir:
structuredConfig:
alertmanager_storage:
backend: s3
s3:
bucket_name: ${MIMIR_ALERTS_BUCKET}
blocks_storage:
backend: s3
bucket_store:
sync_dir: /data/tsdb-sync
s3:
bucket_name: ${MIMIR_BLOCKS_BUCKET}
tsdb:
dir: /data/tsdb
common:
storage:
backend: s3
s3:
access_key_id: ${AWS_ACCESS_KEY_ID}
bucket_name: ${MIMIR_BLOCKS_BUCKET}
endpoint: s3.amazonaws.com
region: ${AWS_REGION}
secret_access_key: ${AWS_SECRET_ACCESS_KEY}
frontend:
log_queries_longer_than: 10s
limits:
ingestion_burst_size: 150000
ingestion_rate: 50000
max_global_series_per_metric: 150000
max_global_series_per_user: 2000000
memberlist:
abort_if_cluster_join_fails: false
join_members:
- mimir-gossip-ring.mimir.svc.cluster.local:7946
multitenancy_enabled: true
ruler_storage:
backend: s3
s3:
bucket_name: ${MIMIR_RULES_BUCKET}
server:
grpc_listen_port: 9095
http_listen_port: 8080
minio:
enabled: false
mode: microservices
nginx:
enabled: false
querier:
replicas: 1
query_frontend:
replicas: 1
query_scheduler:
enabled: true
replicas: 2
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 256Mi
ruler:
enabled: false
store_gateway:
persistentVolume:
enabled: false
replicas: 1
Mimir Installation
# Create secret for AWS credentials (if not exists)
kubectl create secret generic aws-credentials \
--from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
--from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
--namespace mimir
# Install Mimir
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install mimir grafana/mimir-distributed \
--namespace mimir \
--create-namespace \
--values values-mimir.yaml
4. Grafana Configuration
# values-grafana.yaml
fullnameOverride: "grafana"
# Image configuration
image:
repository: grafana/grafana
tag: "12.0.0"
# Admin credentials using secret
admin:
existingSecret: "grafana-admin-credentials"
userKey: "admin-user"
passwordKey: "admin-password"
# Service configuration
service:
type: ClusterIP
port: 80
# Persistence
persistence:
enabled: true
size: 5Gi
# Database configuration using secrets
env:
GF_DATABASE_TYPE: postgres
GF_DATABASE_HOST: ${DATABASE_HOST}
GF_DATABASE_PORT: "5432"
GF_DATABASE_NAME: grafana
GF_DATABASE_USER: ${DATABASE_USER}
GF_DATABASE_PASSWORD: ${DATABASE_PASSWORD}
GF_DATABASE_SSL_MODE: disable
GF_LOG_LEVEL: warn
# Grafana configuration
grafana.ini:
server:
domain: grafana.${DOMAIN}
root_url: https://grafana.${DOMAIN}
# Ingress configuration
ingress:
enabled: true
ingressClassName: alb
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/group.name: ${ALB_GROUP_NAME}
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/healthcheck-path: /api/health
alb.ingress.kubernetes.io/certificate-arn: ${TLS_CERTIFICATE_ARN}
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
alb.ingress.kubernetes.io/ssl-redirect: '443'
hosts:
- grafana.${DOMAIN}
path: /
pathType: Prefix
Grafana Installation
# Create necessary secrets
kubectl create secret generic grafana-admin-credentials \
--from-literal=admin-user=${GRAFANA_ADMIN_USER} \
--from-literal=admin-password=${GRAFANA_ADMIN_PASSWORD} \
--namespace grafana
# Install Grafana
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install grafana grafana/grafana \
--namespace grafana \
--create-namespace \
--values values-grafana.yaml
5. IngressClass Configuration
# ingress-class.yaml
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: alb
spec:
controller: ingress.k8s.aws/alb
kubectl apply -f ingress-class.yaml
6. Grafana Agent Configuration (Standalone)
For monitoring external resources outside the cluster:
# agent-config.yaml
server:
log_level: debug
integrations:
node_exporter:
enabled: true
rootfs_path: /
sysfs_path: /sys
procfs_path: /proc
set_collectors:
- uname
- cpu
- loadavg
- meminfo
- filesystem
- netdev
- diskstats
- cpufreq
- os
- time
- xfs
- cpu_guest_seconds_metric
- boottime
- systemd
- processes
- nvme
- nfs
- netstat
- logind
- stat
- vmstat
relabel_configs:
- action: replace
replacement: '${INSTANCE_NAME}'
target_label: instance
- action: replace
replacement: '${TENANT_ID}'
target_label: tenant
metrics:
wal_directory: /tmp/grafana-agent-wal
global:
scrape_interval: 30s
remote_write:
- url: https://mimir.${DOMAIN}/api/v1/push
headers:
X-Scope-OrgID: ${TENANT_ID}
configs:
- name: default
scrape_configs:
- job_name: agent
static_configs:
- targets: [ '127.0.0.1:9090' ]
logs:
configs:
- name: default
clients:
- url: https://loki.${DOMAIN}/loki/api/v1/push
headers:
X-Scope-OrgID: ${TENANT_ID}
positions:
filename: /tmp/positions.yaml
scrape_configs:
- job_name: docker
docker_sd_configs:
- host: "unix:///var/run/docker.sock"
refresh_interval: 30s
relabel_configs:
- source_labels: [ '__meta_docker_container_name' ]
target_label: 'container'
- source_labels: [ '__meta_docker_container_name' ]
target_label: 'service_name'
- source_labels: [ '__meta_docker_container_log_stream' ]
target_label: 'stream'
- action: replace
replacement: '${INSTANCE_NAME}'
target_label: instance
- action: replace
replacement: '${TENANT_ID}'
target_label: tenant
- job_name: syslog
static_configs:
- targets: [ 'localhost' ]
labels:
job: syslog
tenant: ${TENANT_ID}
relabel_configs:
- action: replace
target_label: instance
replacement: '${INSTANCE_NAME}'
- action: replace
target_label: tenant
replacement: '${TENANT_ID}'
7. Complete Deployment Script
#!/bin/bash
# Environment variables
export CLUSTER_NAME="production-cluster"
export TENANT_ID="company"
export DOMAIN="monitoring.company.com"
export ALB_GROUP_NAME="monitoring-alb"
export TLS_CERTIFICATE_ARN="arn:aws:acm:region:account:certificate/cert-id"
export AWS_REGION="us-east-1"
export LOKI_S3_BUCKET="company-loki-data"
export MIMIR_BLOCKS_BUCKET="company-mimir-blocks"
export MIMIR_ALERTS_BUCKET="company-mimir-alerts"
export MIMIR_RULES_BUCKET="company-mimir-rules"
export DATABASE_HOST="postgres.company.internal"
export DATABASE_USER="grafana"
export DATABASE_PASSWORD="secure-password"
export GRAFANA_ADMIN_USER="admin"
export GRAFANA_ADMIN_PASSWORD="admin-password"
# Add Helm repositories
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Create namespaces
kubectl create namespace monitoring
kubectl create namespace loki
kubectl create namespace mimir
kubectl create namespace grafana
# Create secrets
kubectl create secret generic aws-credentials \
--from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
--from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
--namespace loki
kubectl create secret generic aws-credentials \
--from-literal=access_key_id=${AWS_ACCESS_KEY_ID} \
--from-literal=secret_access_key=${AWS_SECRET_ACCESS_KEY} \
--namespace mimir
kubectl create secret generic grafana-admin-credentials \
--from-literal=admin-user=${GRAFANA_ADMIN_USER} \
--from-literal=admin-password=${GRAFANA_ADMIN_PASSWORD} \
--namespace grafana
# Deploy Loki
helm upgrade --install loki grafana/loki-distributed \
--namespace loki \
--values values-loki.yaml
# Deploy Mimir
helm upgrade --install mimir grafana/mimir-distributed \
--namespace mimir \
--values values-mimir.yaml
# Deploy Grafana
helm upgrade --install grafana grafana/grafana \
--namespace grafana \
--values values-grafana.yaml
# Deploy Alloy
helm upgrade --install alloy grafana/k8s-monitoring \
--namespace monitoring \
--values values-alloy.yaml
echo "Deployment completed! Wait a few minutes for all pods to be ready."
8. Verification and Monitoring
Check pod status:
kubectl get pods -n monitoring
kubectl get pods -n loki
kubectl get pods -n mimir
kubectl get pods -n grafana
Check ingress:
kubectl get ingress -A
Access logs:
kubectl logs -f deployment/alloy -n monitoring
kubectl logs -f deployment/loki-distributor -n loki
kubectl logs -f deployment/mimir-distributor -n mimir
9. Grafana Data Sources Configuration
After deployment, configure the following data sources in Grafana:
Loki Data Source:
- URL:
http://loki-loki-distributed-gateway.loki.svc.cluster.local
- Headers:
X-Scope-OrgID: ${TENANT_ID}
Mimir Data Source:
- URL:
http://mimir-gateway.mimir.svc.cluster.local
- Headers:
X-Scope-OrgID: ${TENANT_ID}
10. Production Considerations
Security:
- Use Kubernetes secrets for sensitive credentials
- Configure appropriate RBAC
- Enable TLS for all communications
Performance:
- Adjust resource limits based on data volume
- Configure appropriate retention
- Use suitable storage classes
Backup:
- Configure S3 data backup
- Document restore procedures
- Regularly test backups
Monitoring:
- Configure alerts for component failures
- Monitor resource usage
- Set up health check dashboards
Conclusion
This implementation provides a complete observability solution for Kubernetes environments, with high availability, scalability, and native AWS integration. The stack is suitable for production environments and can be customized according to specific needs.
Next Steps:
- Configure custom dashboards
- Implement critical alerts
- Configure retention policies
- Optimize performance based on real metrics
Top comments (0)