DEV Community

Cover image for Multi-Cluster Prometheus: Scaling Metrics Across Kubernetes Clusters
Daniel Quackenbush
Daniel Quackenbush

Posted on • Originally published at danquack.dev

6

Multi-Cluster Prometheus: Scaling Metrics Across Kubernetes Clusters

Building upon Bartłomiej Płotka's insightful blog on Prometheus and its passthrough agent mode, this post dives into implementing multi-cluster Prometheus support. Notably, the official inclusion of support in the widely-used kube-prometheus-stack came with the release in July 2023, making it easier to extend Prometheus monitoring across clusters.

Helm Configuration: Connecting Global and Edge Clusters

To deploy a Prometheus agent in your Kubernetes cluster, use the kube-prometheus-stack chart with Helm installed and configured on your cluster.

Global Cluster Configuration

Update and apply global cluster values to enable remote metrics reception:



prometheus:
  prometheusSpec:
    enableRemoteWriteReceiver: true
    enableFeatures:
    - remote-write-receiver


Enter fullscreen mode Exit fullscreen mode

Edge Cluster(s) Configuration

Update and apply edge cluster values to enable remote metric writing:

  • Specify the global cluster's ${hostname} remote write endpoint.
  • Include a scrape configuration to identify pods annotated with prometheus.io/scrape.
  • Assign a unique ${name} to distinguish metric results.


prometheus:
  agentMode: true
  prometheusSpec:
    remoteWrite:
      - name: ${name}
        url: https://${hostname}/api/v1/write
     # Add additional remote write configurations if needed
    additionalScrapeConfigs:
      - job_name: 'kubernetes'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - target_label: cluster
            replacement: ${name}


Enter fullscreen mode Exit fullscreen mode

These resulting metrics are illustrated within the global Prometheus cluster, featuring an Apache web service managing ingress traffic in a cluster labeled as "edge."

Prometheus Output

Troubleshooting

Prometheus exposes internal metrics for effective troubleshooting. For in-depth troubleshooting steps, refer to the detailed Grafana Labs blog post.

Key metrics to monitor:

  • Queue Tracking: prometheus_remote_storage_highest_timestamp_in_seconds and prometheus_remote_storage_queue_highest_sent_timestamp_seconds will track the backlog/queue.

  • Shard Considerations: prometheus_remote_storage_shards_desired should be less than prometheus_remote_storage_shards_max. If it is greater, you may want to consider updating max_shards - see Prometheus Parameter Tuning.

Image of Datadog

The Essential Toolkit for Front-end Developers

Take a user-centric approach to front-end monitoring that evolves alongside increasingly complex frameworks and single-page applications.

Get The Kit

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay