DEV Community

Cover image for Multi-Cluster Prometheus: Scaling Metrics Across Kubernetes Clusters
Daniel Quackenbush
Daniel Quackenbush

Posted on • Originally published at danquack.dev

6

Multi-Cluster Prometheus: Scaling Metrics Across Kubernetes Clusters

Building upon Bartłomiej Płotka's insightful blog on Prometheus and its passthrough agent mode, this post dives into implementing multi-cluster Prometheus support. Notably, the official inclusion of support in the widely-used kube-prometheus-stack came with the release in July 2023, making it easier to extend Prometheus monitoring across clusters.

Helm Configuration: Connecting Global and Edge Clusters

To deploy a Prometheus agent in your Kubernetes cluster, use the kube-prometheus-stack chart with Helm installed and configured on your cluster.

Global Cluster Configuration

Update and apply global cluster values to enable remote metrics reception:



prometheus:
  prometheusSpec:
    enableRemoteWriteReceiver: true
    enableFeatures:
    - remote-write-receiver


Enter fullscreen mode Exit fullscreen mode

Edge Cluster(s) Configuration

Update and apply edge cluster values to enable remote metric writing:

  • Specify the global cluster's ${hostname} remote write endpoint.
  • Include a scrape configuration to identify pods annotated with prometheus.io/scrape.
  • Assign a unique ${name} to distinguish metric results.


prometheus:
  agentMode: true
  prometheusSpec:
    remoteWrite:
      - name: ${name}
        url: https://${hostname}/api/v1/write
     # Add additional remote write configurations if needed
    additionalScrapeConfigs:
      - job_name: 'kubernetes'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - target_label: cluster
            replacement: ${name}


Enter fullscreen mode Exit fullscreen mode

These resulting metrics are illustrated within the global Prometheus cluster, featuring an Apache web service managing ingress traffic in a cluster labeled as "edge."

Prometheus Output

Troubleshooting

Prometheus exposes internal metrics for effective troubleshooting. For in-depth troubleshooting steps, refer to the detailed Grafana Labs blog post.

Key metrics to monitor:

  • Queue Tracking: prometheus_remote_storage_highest_timestamp_in_seconds and prometheus_remote_storage_queue_highest_sent_timestamp_seconds will track the backlog/queue.

  • Shard Considerations: prometheus_remote_storage_shards_desired should be less than prometheus_remote_storage_shards_max. If it is greater, you may want to consider updating max_shards - see Prometheus Parameter Tuning.

Reinvent your career. Join DEV.

It takes one minute and is worth it for your career.

Get started

Top comments (0)

Heroku

Simplify your DevOps and maximize your time.

Since 2007, Heroku has been the go-to platform for developers as it monitors uptime, performance, and infrastructure concerns, allowing you to focus on writing code.

Learn More

👋 Kindness is contagious

Dive into an ocean of knowledge with this thought-provoking post, revered deeply within the supportive DEV Community. Developers of all levels are welcome to join and enhance our collective intelligence.

Saying a simple "thank you" can brighten someone's day. Share your gratitude in the comments below!

On DEV, sharing ideas eases our path and fortifies our community connections. Found this helpful? Sending a quick thanks to the author can be profoundly valued.

Okay