Philippe Bürgisser for Camptocamp Infrastructure Solutions

Posted on Mar 3, 2021

TKGI: Observability challenge

#kubernetes #devops #monitoring #prometheus

Introduction

In this post we’re going to review the observability options on a Kubernetes multi-cluster managed by VMware TKGI (Tanzu Kubernetes Grid Integrated).

When deployed using the TKGI toolset, Kubernetes comes with a concepts of metric sinks to collect data from the platform and/or from applications. Based on Telegraf, the metrics are pushed to a destination that has to be set in the ClusterMetricSink CR object.

In our use case, TKGI is used to deploy one Kubernetes cluster per application/environment (dev, qa, prod) from which we need to collect metrics. At this customer, we also operate a Prometheus stack which is used to scrape data from traditional virtual machines and containers running on OpenShift, in order to handle alarms and to offer dashboards to end-users via Grafana.
We have explored different architectures of implementation that match our current monitoring system and our internal process.

Architecture 1

In this scenario, we leverage the usage of the (cluster) MetricSink provided by VMware, configured to push the data into a central InfluxDB database. Data pushed by Telegraf can either come from pushed metrics from Telegraf Agent or can be scrape by Telegraf. Then running Telegraf, a pod is running on each node, deployed via a DaemonSet. Grafana has a database connector able to connect to InfluxDB.

Pros

Easiest implementation
No extra software to deploy on Kubernetes
Multi-tenancy of data
RBAC for data access

Cons

InfluxDB cannot scale and there is no HA in the free version
Need to rewrite Grafana dashboards in order to to match the InfluxDB query language
Integration with our current alarm flow

Architecture 2

Telegraf is able to expose data using the Prometheus format over an HTTP endpoint. This configuration is done using the MetricSink CR. Prometheus will then scrape the Telegraf service.
When Telegraf is deployed on each node using a DeamonSet, it comes with a Kubernetes service so we can access the exposed service. As Prometheus is sitting outside of the targeted cluster, it is not possible to directly access each Telegraf endpoint as it needs to be accessed through a Kubernetes service. The main drawback of this architecture is that we cannot ensure that all endpoints are scraped evenly, so it may create gaps in the metrics. We have also noticed that when Telegraf is configured to expose Prometheus data over HTTP, the service isn’t updated to match the new exposed port. One solution would have been to create another service in the namespace where telegraf resides but due to RBAC, we aren’t allowed to do so.

apiVersion: pksapi.io/v1beta1
kind: ClusterMetricSink
metadata:
  name: my-demo-sink
spec:
  inputs:
  - monitor_kubernetes_pods: true
    type: prometheus
  outputs:
    - type: prometheus_client

Pros

We can leverage the usage of Telegraf and MetricSinks
Integration with our existing Prometheus stack
Prometheus ServiceDiscovery possible through Kubernetes API

Cons

No direct access to Telegraf endpoints
Depending on the number of targets to discover for each
Kubernetes cluster, the ServiceDiscovery can be impacted in terms of performance

Architecture 3

In this architecture, we configure Prometheus to directly scrape exporters running on each cluster. Unfortunately, each replica of a pod running the exporter is exposing its endpoint through a Kubernetes service. As mentioned in architecture 2, Prometheus, living outside, cannot directly scrape an endpoint and we thus can’t ensure the scraping is evenly done.

Pros

Integration with our Prometheus stack
Prometheus ServiceDiscovery possible through Kubernetes API

Cons

No direct access to exporter endpoints
Not good for scaling

Architecture 4

This is a hybrid approach where we leverage the metric tooling provided by VMware. We push all the metrics into an InfluxDB exporter acting as proxy-cache, which is scrapped by Prometheus.

Pros

Leveraging VMware tooling

Cons

InfluxDB exporter becomes an SPOF (Single Point of Failure)
Extra components to manage
No Prometheus ServiceDiscovery available
Handling of data expiration

Architecture 5

In this architecture we introduce PushProx, composed of a proxy running on the same cluster as Prometheus and agents that are running on each Kubernetes cluster. These agents initiate a connection towards the proxy to create a tunnel so Prometheus can directly scrape each endpoint through the tunnel.

Each scraping configuration will need to have a proxy referenced:

scrape_configs:
- job_name: node
  proxy_url: http://proxy:8080/
  static_configs:
    - targets: ['client:9100']

Pros

Bypass network segmentation
Integration with our Prometheus stack

Cons

No Prometheus ServiceDiscovery
Scaling issue
Extra component to manage

Architecture 6

In this architecture, a Prometheus instance is deployed on each cluster which will scrape targets residing in the same cluster. Using this design, the data will be stored on each instance. The major difference in this approach is that only the AlertManager and Grafana are shared for all clusters.

Pros

Best integration with our Prometheus stack
Multi-tenancy
Federation possible

Cons

Memory and CPU footprint due repetition of the same service
Skip using any TKGI component
Multiple instances to manage

Conclusion

After testing almost all architectures, we finally came to the conclusion that architecture 6 is the best match with our current architecture and needs. We also privileged Prometheus as it can be easily deployed using the operator and features such as HA is automatically managed. We had however to make some compromises like not using the TKGI metric components and “reinvent the wheel” as we believe that monitoring and alerting should be done by pulling data and not pushing them.

Disclaimer

This research was made on a TKGI environment that hasn't been installed and operated by Camptocamp.

DEV Community

TKGI: Observability challenge

Introduction

Architecture 1

Pros

Cons

Architecture 2

Pros

Cons

Architecture 3

Pros

Cons

Architecture 4

Pros

Cons

Architecture 5

Pros

Cons

Architecture 6

Pros

Cons

Conclusion

Disclaimer

Top comments (0)