Metrics tell us something is wrong. Logs tell us why. We need both. This post covers how I set up the full observability stack for ASTRING — Prometheus and Grafana for metrics, Fluent Bit and Loki for logs, and Alertmanager.
Metrics: Prometheus and Grafana
Why Prometheus
Prometheus is the standard for Kubernetes monitoring. It scrapes /metrics endpoints from our services and stores everything as time series data. PromQL lets us query and aggregate across that data. It also handles alerting rules, which I'll get to later.
The easiest way to get the full stack running on Kubernetes is kube-prometheus-stack — it bundles Prometheus, Grafana, Alertmanager, and a set of pre-built dashboards and alerting rules for Kubernetes components.
Installing kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace prometheus \
--create-namespace \
--values values.yaml
A minimal values.yaml to get started:
grafana:
adminUser: admin
adminPassword: your_password
service:
type: NodePort
nodePort: 30000
Once it's running, Grafana comes with pre-built dashboards for cluster health, node resource usage, pod performance, and Kubernetes component metrics. I actively use these — mostly for checking memory and CPU trends across the cluster.
Logs: Why Not ELK
The standard alternative to what I'm using is the ELK stack — Elasticsearch, Logstash, Kibana. It's powerful but heavy. Elasticsearch automatically creates indexes for everything it ingests, which means significant memory and CPU overhead even at low log volumes. On a 3-node cluster with limited resources, running Elasticsearch alongside everything else didn't make sense. It also adds Kibana as a separate UI, which means maintaining two dashboards.
Loki takes a different approach — it indexes only metadata (labels like pod name, namespace, container), not the full log content. The logs themselves are stored compressed in object storage. This makes it much lighter to run and cheaper to store. Since it's built by Grafana Labs, it integrates directly into Grafana as a data source — same dashboard for metrics and logs.
Fluent Bit runs as a DaemonSet on every node, tails container log files, and ships them to Loki. It's lightweight by design, built for high-throughput log forwarding without consuming much memory.
Setting Up Loki
Loki stores logs in S3-compatible object storage — I use Cloudflare R2.
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki \
--namespace logging \
--create-namespace \
-f values.yaml
The important parts of values.yaml:
deploymentMode: SingleBinary
loki:
commonConfig:
replication_factor: 1
ingester:
chunk_encoding: snappy
querier:
max_concurrent: 2
schemaConfig:
configs:
- from: "2024-06-01"
index:
period: 24h
prefix: loki_index_
object_store: s3
schema: v13
store: tsdb
storage:
bucketNames:
admin: <bucket_name>
chunks: <bucket_name>
ruler: <bucket_name>
s3:
accessKeyId: <access_key>
secretAccessKey: <secret_key>
s3: s3://<access_key>:<secret_key>@<r2_endpoint>/<bucket_name>
s3ForcePathStyle: false
insecure: false
type: s3
singleBinary:
replicas: 1
resources:
limits:
cpu: 3
memory: 3Gi
requests:
cpu: 2
memory: 1Gi
extraEnv:
- name: GOMEMLIMIT
value: 2750MiB
minio:
enabled: false
SingleBinary mode runs everything in one pod — suitable for a small cluster. chunk_encoding: snappy compresses logs before storing them in R2, which reduces storage costs. GOMEMLIMIT caps Go's memory usage to stay within the pod's memory limit — same issue as GOMAXPROCS but for memory.
Setting Up Fluent Bit
Fluent Bit runs as a DaemonSet — one pod per node, tailing all container logs at /var/log/containers/*.log and forwarding to Loki.
helm repo add fluent https://fluent.github.io/helm-charts
helm repo update
helm install fluent-bit fluent/fluent-bit \
--namespace logging \
-f values.yaml
The important parts of values.yaml:
args:
- -e
- /fluent-bit/bin/out_grafana_loki.so
- --workdir=/fluent-bit/etc
- --config=/fluent-bit/etc/conf/fluent-bit.conf
config:
inputs: |
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
multiline.parser docker, cri
Mem_Buf_Limit 5MB
Skip_Long_Lines On
outputs: |
[Output]
Name grafana-loki
Match kube.*
Url ${FLUENT_LOKI_URL}
TenantID foo
Labels {job="fluent-bit"}
LabelKeys level,app
BatchWait 1
BatchSize 1001024
LineFormat json
LogLevel info
AutoKubernetesLabels true
env:
- name: FLUENT_LOKI_URL
value: http://loki-gateway.logging.svc.cluster.local/loki/api/v1/push
image:
repository: grafana/fluent-bit-plugin-loki
tag: main-e2ed1c0
AutoKubernetesLabels true automatically attaches Kubernetes metadata (pod name, namespace, container name) as Loki labels — this makes filtering logs in Grafana much more useful. LabelKeys level,app promotes those specific fields into Loki stream labels, everything else becomes structured metadata.
Connecting Loki to Grafana
In Grafana, add Loki as a data source:
- Go to Configuration → Data Sources → Add data source
- Select Loki
- Set URL to
http://loki-gateway.logging.svc.cluster.local - Click Save & Test
Now logs are queryable in the Explore tab using LogQL, and we can build dashboards that combine metrics from Prometheus and logs from Loki in the same view.
Current State
The full observability stack running on the cluster:
- Prometheus — scraping metrics from all services and Kubernetes components
- Grafana — dashboards for cluster health, pod performance, and logs
- Alertmanager — firing alerts to Telegram on pod crashes, high memory, and disk usage
- Loki — storing logs in Cloudflare R2
- Fluent Bit — collecting and forwarding logs from every node
I actively use Grafana for both metrics and logs. When something goes wrong, Telegram fires first, then I open Grafana to dig into what happened.
Top comments (0)