Python-T Point

Posted on May 19 • Originally published at pythontpoint.in

⚙️ Monitoring MinIO with Prometheus and Grafana — the right way for production

#cloud #devops #kubernetes #tutorial

A full monitoring setup can generate zero actionable alerts — when metrics aren’t tied to system invariants, not just resource usage. The issue isn’t the dashboard; it’s that CPU and memory alone can’t tell you whether your object storage is actually working.

📑 Table of Contents

🔧 Prerequisites — What You Need
📊 Prometheus Setup — Scraping Metrics
🔐 Securing the Scrape
🧠 Understanding Metric Cardinality
🎨 Grafana Dashboard — Turning Data into Insight
📈 Key Visualizations to Add
⚠️ Avoiding Dashboard Overload
🚦 Alerting — Preventing Outages
🟩 Final Thoughts
❓ Frequently Asked Questions
Can I monitor standalone MinIO instances?
How often does MinIO emit metrics?
Does monitoring impact MinIO performance?
📚 References & Further Reading

🔧 Prerequisites — What You Need

You need four components to monitor MinIO with Prometheus and Grafana: a running MinIO tenant, Prometheus server, Grafana instance, and network connectivity between them. MinIO exposes metrics via its built-in Prometheus endpoint at /minio/v2/metrics/cluster. This endpoint emits service-level indicators (SLIs) like minio_bucket_objects_total, minio_disk_usage, and minio_s3_requests_duration_seconds. These are not host-level metrics — they reflect object storage behavior across the entire tenant. Ensure your MinIO deployment is in distributed mode (at least 4 nodes) and running a recent version (RELEASE.-xx-xx or later). Older versions lack critical instrumentation for cluster-wide metrics. Verify the metrics endpoint is accessible:

$ curl -s http://minio-tenant:9000/minio/v2/metrics/cluster | head -5
# HELP minio_bucket_objects_total Total number of objects in a bucket
# TYPE minio_bucket_objects_total gauge
minio_bucket_objects_total{bucket="logs"} 24892
minio_bucket_objects_total{bucket="backups"} 512
# HELP minio_disk_usage Total disk usage in bytes

If you see metric lines, the endpoint is live. If you get a 401, ensure your admin credentials are correct. The endpoint requires admin privileges. MinIO uses HTTP basic auth — Prometheus must supply credentials in the scrape job.

📊 Prometheus Setup — Scraping Metrics

Prometheus must be configured to scrape MinIO’s cluster metrics endpoint every 30 seconds, using secure credentials and proper relabeling to extract tenant and bucket labels. Here’s the scrape job configuration for prometheus.yml:

scrape_configs: - job_name: 'minio-cluster' metrics_path: /minio/v2/metrics/cluster static_configs: - targets: ['minio-tenant-1.example.com:9000'] basic_auth: username: 'admin' password: 'your-secure-password' relabel_configs: - source_labels: [__address__] target_label: instance - target_label: job replacement: minio_cluster

This job scrapes the /minio/v2/metrics/cluster path, which aggregates metrics across all nodes in the tenant. That’s key: you’re not scraping individual nodes, but the cluster view, avoiding duplication and gaps. Prometheus uses HTTP polling — every 30 seconds, it makes a GET request, receives plain-text OpenMetrics, and parses it into time series. Each metric gets a timestamp and is stored in Prometheus’s local TSDB using a write-optimized block structure (WAL + memory-mapped chunks). This design minimizes disk seeks but requires compaction later. Restart Prometheus:

$ sudo systemctl reload prometheus
# OR if using Docker:
$ docker restart prometheus

Verify the target is up in Prometheus web UI at http://prometheus:9090/targets. You should see minio-cluster with state "UP". Query a sample metric:

$ curl -G http://prometheus:9090/api/v1/query \ -data-urlencode 'query=minio_bucket_objects_total' | jq
{ "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "minio_bucket_objects_total", "bucket": "logs", "instance": "minio-tenant-1.example.com:9000", "job": "minio_cluster" }, "value": [1700000000, "24892"] } ] }
}

The value array contains [timestamp, string_value]. Prometheus stores all values as float64 internally but serializes integers as strings in JSON responses.

🔐 Securing the Scrape

Never expose MinIO’s admin port publicly. Use either:

Mutual TLS (mTLS) between Prometheus and MinIO
Or a sidecar reverse proxy with IP filtering For mTLS, generate client certs and update the scrape config:

tls_config: ca_file: /etc/prometheus/minio-ca.crt cert_file: /etc/prometheus/prom-client.crt key_file: /etc/prometheus/prom-client.key insecure_skip_verify: false

This ensures authentication and encryption at the transport layer — preventing credential leakage and tampering.

🧠 Understanding Metric Cardinality

MinIO metrics include labels like bucket, node, and operation. High cardinality (e.g., thousands of buckets) can explode Prometheus memory usage. Monitor prometheus_tsdb_head_series — if it grows beyond 10M series, consider:

Aggregating metrics in Grafana (e.g., sum by (operation))
Or using recording rules to pre-aggregate Example recording rule:

groups: - name: minio-aggregated rules: - record: job:minio_bucket_objects_total:sum expr: sum by (job) (minio_bucket_objects_total)

This reduces cardinality by pre-summing object counts per job, lowering query load and memory pressure.

“Monitoring MinIO with Prometheus and Grafana isn’t about collecting data — it’s about isolating failure modes before they isolate you.”

🎨 Grafana Dashboard — Turning Data into Insight

A Grafana dashboard should answer: Is my MinIO tenant healthy? Are objects being written and read reliably? Is erasure coding balanced? Start by adding Prometheus as a data source in Grafana. Then import MinIO’s official dashboard (ID: 18085) from Grafana.com:

$ curl -o minio-dashboard.json \ https://grafana.com/api/dashboards/18085/revisions/1/download

Then import via UI or API. The dashboard shows:

Bucket object counts and growth rate
S3 request rates and error ratios
Disk usage and free space per node
Replication and healing queue depths Under the hood, Grafana runs PromQL queries every 30 seconds. For example, object growth uses: "promql sum(rate(minio_bucket_objects_total[5m])) " rate() calculates per-second increase over a 5-minute window, then sum() aggregates across all buckets. This works because minio_bucket_objects_total is a counter — it only increases, and Prometheus handles resets (e.g., after restart) by detecting negative deltas.

📈 Key Visualizations to Add

The default dashboard is good, but production needs deeper insight. Add these panels: 1. Erasure Set Imbalance:

"promql max by (set) (minio_erasure_set_drives_online) / on(set) group_left max by (set) (minio_erasure_set_drives_total) " This shows the ratio of online drives per erasure set. Below 1.0 means degraded performance due to missing or failed drives. 2. Healing Queue Lag:

"promql max(minio_healing_queue_length) " If this is >0 for more than 10 minutes, background healing is falling behind — could indicate disk failures or sustained I/O pressure. 3. S3 Error Rate:

"promql sum(rate(minio_s3_requests_duration_seconds_count{code=~"5.."}[5m])) / sum(rate(minio_s3_requests_duration_seconds_count[5m])) " This computes the HTTP 5xx error ratio over a 5-minute sliding window. Values above 1% indicate potential service degradation.

⚠️ Avoiding Dashboard Overload

Don’t add every metric. Focus on SLO-relevant signals :

Object durability (replication/healing)
Read/write availability (error rates)
Capacity planning (growth trends) Too many graphs create noise. A clean dashboard with 6-8 panels is better than 50.

🚦 Alerting — Preventing Outages

Alerts must be specific, actionable, and based on symptoms — not thresholds. Monitoring MinIO with Prometheus and Grafana means alerting on what users experience , not just what the system reports. Use Prometheus alerting rules in a dedicated file:

groups: - name: minio-alerts rules: - alert: MinIOHighS3ErrorRate expr: | sum(rate(minio_s3_requests_duration_seconds_count{code=~"5.."}[5m])) / sum(rate(minio_s3_requests_duration_seconds_count[5m])) > 0.01 for: 5m labels: severity: critical annotations: summary: "High S3 error rate on MinIO" description: "Error rate is {{ $value }} over 5m" - alert: MinIOErasureSetDegraded expr: minio_erasure_set_drives_online < minio_erasure_set_drives_total for: 10m labels: severity: warning annotations: summary: "Erasure set partially offline" description: "One or more drives offline for over 10m" - alert: MinIODiskAlmostFull expr: minio_disk_usage / minio_disk_total > 0.85 for: 1h labels: severity: warning annotations: summary: "MinIO disk usage >85%" description: "Disk {{ $labels.instance }} is running out of space"

These alerts trigger only after sustained conditions (for:), preventing flapping. Prometheus sends alerts to Alertmanager , which deduplicates, groups, and routes them via email, Slack, or PagerDuty. Monitoring MinIO with Prometheus and Grafana turns reactive firefighting into proactive resilience.

🟩 Final Thoughts

Monitoring MinIO with Prometheus and Grafana isn’t just a DevOps checkbox — it’s how you prove your object storage is reliable. Metrics like bucket growth, healing queues, and S3 error rates expose issues long before users notice. The system doesn’t just react; it anticipates. Too many teams treat monitoring as a sidecar — something added after the fact. But in distributed systems, observability is part of the design. You wouldn’t deploy a database without backups; don’t deploy MinIO without instrumentation. The real win isn’t the dashboard. It’s knowing, at any moment, whether your data is safe, accessible, and consistent — because the metrics say so.

❓ Frequently Asked Questions

Can I monitor standalone MinIO instances?

Yes, but the /minio/v2/metrics/cluster endpoint only works in distributed mode. For standalone, use /minio/metrics/instance — but you’ll miss tenant-wide aggregation. (More onPythonTPoint tutorials)

How often does MinIO emit metrics?

MinIO updates metrics every 5 seconds in memory. Prometheus typically scrapes every 30s, so there’s no data loss. The values are gauges and counters, not sampled.

Does monitoring impact MinIO performance?

Negligibly. The metrics endpoint reads from in-memory counters — no disk I/O or locking. Even under heavy load, response time is under 10ms. Scrape every 30s to minimize overhead.

📚 References & Further Reading

MinIO Monitoring Guide — official documentation on metrics, alerts, and dashboards: docs.min.io
Prometheus Configuration — detailed syntax for scrape jobs, relabeling, and TLS: prometheus.io
Grafana Dashboard Best Practices — how to build effective, maintainable dashboards: grafana.com

DEV Community