yep

Posted on Apr 20 • Originally published at yepchaos.com

Monitoring & Observability

#monitoring #observability #kubernetes

Metrics tell us something is wrong. Logs tell us why. We need both. This post covers how I set up the full observability stack for ASTRING — Prometheus and Grafana for metrics, Fluent Bit and Loki for logs, and Alertmanager.

Metrics: Prometheus and Grafana

Why Prometheus

Prometheus is the standard for Kubernetes monitoring. It scrapes /metrics endpoints from our services and stores everything as time series data. PromQL lets us query and aggregate across that data. It also handles alerting rules, which I'll get to later.

The easiest way to get the full stack running on Kubernetes is kube-prometheus-stack — it bundles Prometheus, Grafana, Alertmanager, and a set of pre-built dashboards and alerting rules for Kubernetes components.

Installing kube-prometheus-stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace prometheus \
  --create-namespace \
  --values values.yaml

A minimal values.yaml to get started:

grafana:
  adminUser: admin
  adminPassword: your_password
  service:
    type: NodePort
    nodePort: 30000

Once it's running, Grafana comes with pre-built dashboards for cluster health, node resource usage, pod performance, and Kubernetes component metrics. I actively use these — mostly for checking memory and CPU trends across the cluster.

Logs: Why Not ELK

The standard alternative to what I'm using is the ELK stack — Elasticsearch, Logstash, Kibana. It's powerful but heavy. Elasticsearch automatically creates indexes for everything it ingests, which means significant memory and CPU overhead even at low log volumes. On a 3-node cluster with limited resources, running Elasticsearch alongside everything else didn't make sense. It also adds Kibana as a separate UI, which means maintaining two dashboards.

Loki takes a different approach — it indexes only metadata (labels like pod name, namespace, container), not the full log content. The logs themselves are stored compressed in object storage. This makes it much lighter to run and cheaper to store. Since it's built by Grafana Labs, it integrates directly into Grafana as a data source — same dashboard for metrics and logs.

Fluent Bit runs as a DaemonSet on every node, tails container log files, and ships them to Loki. It's lightweight by design, built for high-throughput log forwarding without consuming much memory.

Setting Up Loki

Loki stores logs in S3-compatible object storage — I use Cloudflare R2.

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm install loki grafana/loki \
  --namespace logging \
  --create-namespace \
  -f values.yaml

The important parts of values.yaml:

deploymentMode: SingleBinary

loki:
  commonConfig:
    replication_factor: 1
  ingester:
    chunk_encoding: snappy
  querier:
    max_concurrent: 2
  schemaConfig:
    configs:
      - from: "2024-06-01"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v13
        store: tsdb
  storage:
    bucketNames:
      admin: <bucket_name>
      chunks: <bucket_name>
      ruler: <bucket_name>
    s3:
      accessKeyId: <access_key>
      secretAccessKey: <secret_key>
      s3: s3://<access_key>:<secret_key>@<r2_endpoint>/<bucket_name>
      s3ForcePathStyle: false
      insecure: false
    type: s3

singleBinary:
  replicas: 1
  resources:
    limits:
      cpu: 3
      memory: 3Gi
    requests:
      cpu: 2
      memory: 1Gi
  extraEnv:
    - name: GOMEMLIMIT
      value: 2750MiB

minio:
  enabled: false

SingleBinary mode runs everything in one pod — suitable for a small cluster. chunk_encoding: snappy compresses logs before storing them in R2, which reduces storage costs. GOMEMLIMIT caps Go's memory usage to stay within the pod's memory limit — same issue as GOMAXPROCS but for memory.

Setting Up Fluent Bit

Fluent Bit runs as a DaemonSet — one pod per node, tailing all container logs at /var/log/containers/*.log and forwarding to Loki.

helm repo add fluent https://fluent.github.io/helm-charts
helm repo update

helm install fluent-bit fluent/fluent-bit \
  --namespace logging \
  -f values.yaml

The important parts of values.yaml:

args:
  - -e
  - /fluent-bit/bin/out_grafana_loki.so
  - --workdir=/fluent-bit/etc
  - --config=/fluent-bit/etc/conf/fluent-bit.conf

config:
  inputs: |
    [INPUT]
        Name tail
        Tag kube.*
        Path /var/log/containers/*.log
        multiline.parser docker, cri
        Mem_Buf_Limit 5MB
        Skip_Long_Lines On

  outputs: |
    [Output]
        Name grafana-loki
        Match kube.*
        Url ${FLUENT_LOKI_URL}
        TenantID foo
        Labels {job="fluent-bit"}
        LabelKeys level,app
        BatchWait 1
        BatchSize 1001024
        LineFormat json
        LogLevel info
        AutoKubernetesLabels true

env:
  - name: FLUENT_LOKI_URL
    value: http://loki-gateway.logging.svc.cluster.local/loki/api/v1/push

image:
  repository: grafana/fluent-bit-plugin-loki
  tag: main-e2ed1c0

AutoKubernetesLabels true automatically attaches Kubernetes metadata (pod name, namespace, container name) as Loki labels — this makes filtering logs in Grafana much more useful. LabelKeys level,app promotes those specific fields into Loki stream labels, everything else becomes structured metadata.

Connecting Loki to Grafana

In Grafana, add Loki as a data source:

Go to Configuration → Data Sources → Add data source
Select Loki
Set URL to http://loki-gateway.logging.svc.cluster.local
Click Save & Test

Now logs are queryable in the Explore tab using LogQL, and we can build dashboards that combine metrics from Prometheus and logs from Loki in the same view.

Current State

The full observability stack running on the cluster:

Prometheus — scraping metrics from all services and Kubernetes components
Grafana — dashboards for cluster health, pod performance, and logs
Alertmanager — firing alerts to Telegram on pod crashes, high memory, and disk usage
Loki — storing logs in Cloudflare R2
Fluent Bit — collecting and forwarding logs from every node

I actively use Grafana for both metrics and logs. When something goes wrong, Telegram fires first, then I open Grafana to dig into what happened.

DEV Community