In the previous chapter of this series we've created a EC2 instance with Prometheus remote-write collector. So today we're gonna push there some metrics with Grafana cloud agent.
At the moment there's also a possibility to do this with a Prometheus in agent mode. However I did not test this setup yet so we're gonna stick to well-supported Grafana agent.
Kube state metrics
Personally I think that KSM are the most important metrics you need to monitor the cluster. They can show you restarting pods, pending pods and all the states we're trying to avoid somehow.
All you need to do is just install KSM with helm chart:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install ksm prometheus-community/kube-state-metrics \
--set image.tag=v2.2.0 \
--namespace grafana-agent \
--create-namespace
Grafana cloud agent configuration
Now it's time to prepare configuration file for the scraper. Following snippet comes from the official documentation and it basically covers KSM, cadvisor and Kubelet.
kind: ConfigMap
metadata:
name: grafana-agent
namespace: grafana-agent
apiVersion: v1
data:
agent.yaml: |
server:
http_listen_port: 12345
metrics:
wal_directory: /tmp/grafana-agent-wal
global:
scrape_interval: 60s
external_labels:
cluster: cloud
configs:
- name: integrations
remote_write:
- url: http://p01.prometheus.local:9090/api/v1/write
scrape_configs:
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: integrations/kubernetes/cadvisor
kubernetes_sd_configs:
- role: node
metric_relabel_configs:
- source_labels: [__name__]
regex: container_network_transmit_packets_total|kubelet_certificate_manager_server_ttl_seconds|storage_operation_duration_seconds_bucket|node_namespace_pod_container:container_memory_swap|container_fs_writes_total|container_network_receive_bytes_total|kube_daemonset_status_desired_number_scheduled|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|rest_client_requests_total|node_namespace_pod_container:container_memory_working_set_bytes|kubernetes_build_info|kube_node_status_capacity|kubelet_pleg_relist_duration_seconds_bucket|kubelet_running_pods|storage_operation_errors_total|kubelet_running_containers|kube_daemonset_status_number_misscheduled|kube_job_failed|kube_statefulset_status_replicas|kube_job_status_succeeded|container_cpu_cfs_throttled_periods_total|kube_statefulset_status_update_revision|process_resident_memory_bytes|kubelet_pod_start_duration_seconds_count|kubelet_running_container_count|container_fs_writes_bytes_total|machine_memory_bytes|kubelet_cgroup_manager_duration_seconds_count|node_namespace_pod_container:container_memory_rss|kubelet_node_config_error|kubelet_runtime_operations_duration_seconds_bucket|kubelet_pleg_relist_interval_seconds_bucket|kube_job_spec_completions|kube_statefulset_status_current_revision|kube_statefulset_replicas|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|kubelet_node_name|kubelet_pod_worker_duration_seconds_bucket|go_goroutines|kubelet_volume_stats_capacity_bytes|kube_horizontalpodautoscaler_status_current_replicas|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|kubelet_runtime_operations_errors_total|kube_daemonset_status_number_available|kube_deployment_status_replicas_available|up|storage_operation_duration_seconds_count|kube_daemonset_status_current_number_scheduled|kube_statefulset_status_replicas_updated|kube_node_status_condition|kube_node_status_allocatable|rest_client_request_duration_seconds_bucket|container_cpu_usage_seconds_total|namespace_workload_pod:kube_pod_owner:relabel|kubelet_pleg_relist_duration_seconds_count|kube_pod_owner|namespace_cpu:kube_pod_container_resource_requests:sum|kube_horizontalpodautoscaler_spec_max_replicas|kube_statefulset_status_replicas_ready|container_fs_reads_total|node_namespace_pod_container:container_memory_cache|container_network_transmit_packets_dropped_total|kubelet_volume_stats_inodes_used|kube_node_spec_taint|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kube_pod_info|kubelet_cgroup_manager_duration_seconds_bucket|process_cpu_seconds_total|container_memory_cache|kube_statefulset_metadata_generation|kubelet_pod_worker_duration_seconds_count|volume_manager_total_volumes|namespace_cpu:kube_pod_container_resource_limits:sum|kube_deployment_metadata_generation|kube_replicaset_owner|container_memory_swap|kubelet_certificate_manager_client_ttl_seconds|kube_resourcequota|container_fs_reads_bytes_total|kubelet_runtime_operations_total|kube_horizontalpodautoscaler_status_desired_replicas|kube_pod_status_phase|kube_horizontalpodautoscaler_spec_min_replicas|kubelet_server_expiration_renew_errors|kube_pod_container_resource_limits|container_network_transmit_bytes_total|container_network_receive_packets_dropped_total|container_memory_working_set_bytes|kube_pod_container_status_waiting_reason|container_network_receive_packets_total|kube_namespace_created|namespace_workload_pod|kube_pod_container_resource_requests|kubelet_running_pod_count|namespace_memory:kube_pod_container_resource_limits:sum|kube_deployment_status_replicas_updated|kube_statefulset_status_observed_generation|kube_deployment_status_observed_generation|container_cpu_cfs_periods_total|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kubelet_certificate_manager_client_expiration_renew_errors|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kube_daemonset_updated_number_scheduled|kubelet_volume_stats_inodes|kube_node_info|kube_deployment_spec_replicas|container_memory_rss|namespace_memory:kube_pod_container_resource_requests:sum|kubelet_volume_stats_available_bytes
action: keep
relabel_configs:
- replacement: kubernetes.default.svc.cluster.local:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
server_name: kubernetes
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: integrations/kubernetes/kubelet
kubernetes_sd_configs:
- role: node
metric_relabel_configs:
- source_labels: [__name__]
regex: container_network_transmit_packets_total|kubelet_certificate_manager_server_ttl_seconds|storage_operation_duration_seconds_bucket|node_namespace_pod_container:container_memory_swap|container_fs_writes_total|container_network_receive_bytes_total|kube_daemonset_status_desired_number_scheduled|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|rest_client_requests_total|node_namespace_pod_container:container_memory_working_set_bytes|kubernetes_build_info|kube_node_status_capacity|kubelet_pleg_relist_duration_seconds_bucket|kubelet_running_pods|storage_operation_errors_total|kubelet_running_containers|kube_daemonset_status_number_misscheduled|kube_job_failed|kube_statefulset_status_replicas|kube_job_status_succeeded|container_cpu_cfs_throttled_periods_total|kube_statefulset_status_update_revision|process_resident_memory_bytes|kubelet_pod_start_duration_seconds_count|kubelet_running_container_count|container_fs_writes_bytes_total|machine_memory_bytes|kubelet_cgroup_manager_duration_seconds_count|node_namespace_pod_container:container_memory_rss|kubelet_node_config_error|kubelet_runtime_operations_duration_seconds_bucket|kubelet_pleg_relist_interval_seconds_bucket|kube_job_spec_completions|kube_statefulset_status_current_revision|kube_statefulset_replicas|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|kubelet_node_name|kubelet_pod_worker_duration_seconds_bucket|go_goroutines|kubelet_volume_stats_capacity_bytes|kube_horizontalpodautoscaler_status_current_replicas|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|kubelet_runtime_operations_errors_total|kube_daemonset_status_number_available|kube_deployment_status_replicas_available|up|storage_operation_duration_seconds_count|kube_daemonset_status_current_number_scheduled|kube_statefulset_status_replicas_updated|kube_node_status_condition|kube_node_status_allocatable|rest_client_request_duration_seconds_bucket|container_cpu_usage_seconds_total|namespace_workload_pod:kube_pod_owner:relabel|kubelet_pleg_relist_duration_seconds_count|kube_pod_owner|namespace_cpu:kube_pod_container_resource_requests:sum|kube_horizontalpodautoscaler_spec_max_replicas|kube_statefulset_status_replicas_ready|container_fs_reads_total|node_namespace_pod_container:container_memory_cache|container_network_transmit_packets_dropped_total|kubelet_volume_stats_inodes_used|kube_node_spec_taint|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kube_pod_info|kubelet_cgroup_manager_duration_seconds_bucket|process_cpu_seconds_total|container_memory_cache|kube_statefulset_metadata_generation|kubelet_pod_worker_duration_seconds_count|volume_manager_total_volumes|namespace_cpu:kube_pod_container_resource_limits:sum|kube_deployment_metadata_generation|kube_replicaset_owner|container_memory_swap|kubelet_certificate_manager_client_ttl_seconds|kube_resourcequota|container_fs_reads_bytes_total|kubelet_runtime_operations_total|kube_horizontalpodautoscaler_status_desired_replicas|kube_pod_status_phase|kube_horizontalpodautoscaler_spec_min_replicas|kubelet_server_expiration_renew_errors|kube_pod_container_resource_limits|container_network_transmit_bytes_total|container_network_receive_packets_dropped_total|container_memory_working_set_bytes|kube_pod_container_status_waiting_reason|container_network_receive_packets_total|kube_namespace_created|namespace_workload_pod|kube_pod_container_resource_requests|kubelet_running_pod_count|namespace_memory:kube_pod_container_resource_limits:sum|kube_deployment_status_replicas_updated|kube_statefulset_status_observed_generation|kube_deployment_status_observed_generation|container_cpu_cfs_periods_total|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kubelet_certificate_manager_client_expiration_renew_errors|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kube_daemonset_updated_number_scheduled|kubelet_volume_stats_inodes|kube_node_info|kube_deployment_spec_replicas|container_memory_rss|namespace_memory:kube_pod_container_resource_requests:sum|kubelet_volume_stats_available_bytes
action: keep
relabel_configs:
- replacement: kubernetes.default.svc.cluster.local:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/${1}/proxy/metrics
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
server_name: kubernetes
- job_name: integrations/kubernetes/kube-state-metrics
kubernetes_sd_configs:
- role: service
metric_relabel_configs:
- source_labels: [__name__]
regex: container_network_transmit_packets_total|kubelet_certificate_manager_server_ttl_seconds|storage_operation_duration_seconds_bucket|node_namespace_pod_container:container_memory_swap|container_fs_writes_total|container_network_receive_bytes_total|kube_daemonset_status_desired_number_scheduled|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|rest_client_requests_total|node_namespace_pod_container:container_memory_working_set_bytes|kubernetes_build_info|kube_node_status_capacity|kubelet_pleg_relist_duration_seconds_bucket|kubelet_running_pods|storage_operation_errors_total|kubelet_running_containers|kube_daemonset_status_number_misscheduled|kube_job_failed|kube_statefulset_status_replicas|kube_job_status_succeeded|container_cpu_cfs_throttled_periods_total|kube_statefulset_status_update_revision|process_resident_memory_bytes|kubelet_pod_start_duration_seconds_count|kubelet_running_container_count|container_fs_writes_bytes_total|machine_memory_bytes|kubelet_cgroup_manager_duration_seconds_count|node_namespace_pod_container:container_memory_rss|kubelet_node_config_error|kubelet_runtime_operations_duration_seconds_bucket|kubelet_pleg_relist_interval_seconds_bucket|kube_job_spec_completions|kube_statefulset_status_current_revision|kube_statefulset_replicas|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|kubelet_node_name|kubelet_pod_worker_duration_seconds_bucket|go_goroutines|kubelet_volume_stats_capacity_bytes|kube_horizontalpodautoscaler_status_current_replicas|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|kubelet_runtime_operations_errors_total|kube_daemonset_status_number_available|kube_deployment_status_replicas_available|up|storage_operation_duration_seconds_count|kube_daemonset_status_current_number_scheduled|kube_statefulset_status_replicas_updated|kube_node_status_condition|kube_node_status_allocatable|rest_client_request_duration_seconds_bucket|container_cpu_usage_seconds_total|namespace_workload_pod:kube_pod_owner:relabel|kubelet_pleg_relist_duration_seconds_count|kube_pod_owner|namespace_cpu:kube_pod_container_resource_requests:sum|kube_horizontalpodautoscaler_spec_max_replicas|kube_statefulset_status_replicas_ready|container_fs_reads_total|node_namespace_pod_container:container_memory_cache|container_network_transmit_packets_dropped_total|kubelet_volume_stats_inodes_used|kube_node_spec_taint|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kube_pod_info|kubelet_cgroup_manager_duration_seconds_bucket|process_cpu_seconds_total|container_memory_cache|kube_statefulset_metadata_generation|kubelet_pod_worker_duration_seconds_count|volume_manager_total_volumes|namespace_cpu:kube_pod_container_resource_limits:sum|kube_deployment_metadata_generation|kube_replicaset_owner|container_memory_swap|kubelet_certificate_manager_client_ttl_seconds|kube_resourcequota|container_fs_reads_bytes_total|kubelet_runtime_operations_total|kube_horizontalpodautoscaler_status_desired_replicas|kube_pod_status_phase|kube_horizontalpodautoscaler_spec_min_replicas|kubelet_server_expiration_renew_errors|kube_pod_container_resource_limits|container_network_transmit_bytes_total|container_network_receive_packets_dropped_total|container_memory_working_set_bytes|kube_pod_container_status_waiting_reason|container_network_receive_packets_total|kube_namespace_created|namespace_workload_pod|kube_pod_container_resource_requests|kubelet_running_pod_count|namespace_memory:kube_pod_container_resource_limits:sum|kube_deployment_status_replicas_updated|kube_statefulset_status_observed_generation|kube_deployment_status_observed_generation|container_cpu_cfs_periods_total|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kubelet_certificate_manager_client_expiration_renew_errors|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kube_daemonset_updated_number_scheduled|kubelet_volume_stats_inodes|kube_node_info|kube_deployment_spec_replicas|container_memory_rss|namespace_memory:kube_pod_container_resource_requests:sum|kubelet_volume_stats_available_bytes
action: keep
relabel_configs:
- action: keep
regex: ksm-kube-state-metrics
source_labels:
- __meta_kubernetes_service_name
You can see that it's pretty much the same scheme as the Prometheus configuration has. So if you need to adjust some metrics - check the official Prometheus documentation.
Now you can just apply the configuration:
kubectl apply -f config.yaml
Grafana cloud agent deployment
When we have the configuration in place, we can proceed to the last bit - Deployment with the Grafana agent. Again, everything's described in the official documentation but this is what you need to do:
MANIFEST_URL=https://raw.githubusercontent.com/grafana/agent/main/production/kubernetes/agent-bare.yaml \
NAMESPACE=grafana-agent \
/bin/sh -c "$(curl -fsSL https://raw.githubusercontent.com/grafana/agent/release/production/kubernetes/install-bare.sh)" > deploy.yaml
and apply the manifest:
kubectl apply -f deploy.yaml
As the result we have 2 pods running in the grafana-agent
namespace.
NAME READY STATUS RESTARTS AGE
grafana-agent-6f8b68fd6-hbnkr 1/1 Running 0 6d16h
ksm-kube-state-metrics-7d8f59c464-tb9st 1/1 Running 0 10d
Result
And guess what, now we have metrics stored in the Prometheus! If you're wondering why I communicate with localhost:9090
- check the previous chapter. This is AWS SSM port-forwarding functionality that is really useful for some basic debugging.
Wrap
So now we have the complete monitoring stack. But do we really need to use some strange port-forwarding to get there? Not really. The golden standard for the viewing of metrics is the Grafana and this is what we're gonna do in the next chapter. Stay tuned!
Top comments (0)