DEV Community: ii2day

How to quickly realize proactive patrolling for dead-end network connectivity in large-scale clusters

ii2day — Mon, 15 Jan 2024 09:28:09 +0000

01 What is inspection

Cluster patrolling is the process of performing periodic inspections and evaluations of a cluster system, whose main purpose is to ensure the stability, performance, and security of the cluster. Below are a few of the main uses of cluster patrolling:

Troubleshooting and Problem Diagnosis: Inspection can help to identify faults and problems in the cluster and provide diagnosis and solutions. By checking the individual components, configurations and operating conditions of the cluster, potential sources of failure and performance bottlenecks can be identified in a timely manner and appropriate measures can be taken to fix them.
Performance Optimization: Patrols can assess the performance and resource utilization of a cluster. By analyzing the load, resource allocation and configuration of the cluster, problems such as performance bottlenecks, resource wastage and bottlenecks can be identified and optimization recommendations can be provided to improve the performance and efficiency of the cluster.
Security Audit and Compliance: Patrols can check the security and compliance of clusters, including access control, authentication, data protection and other aspects. By auditing the cluster's security configuration, vulnerability management and compliance provisions, potential security risks and compliance issues can be identified and appropriate measures taken for remediation and compliance adjustments.
Capacity Planning and Scalability: The walk-throughs allow assessment of the cluster's capacity utilization and scalability requirements. This helps to predict future resource requirements, plan scaling strategies, and provide recommendations to ensure that the cluster has sufficient capacity and scalability to meet business growth and change.
High Availability and Redundancy Strategies: Patrols can assess the cluster's high availability and redundancy strategies. By examining the cluster's failover, backup, and recovery mechanisms, potential single points of failure and availability issues can be identified and recommendations can be provided to enhance the reliability and redundancy of the cluster.

02 Traditional Network Active Patrol Pain Points

Proactive patrols are mostly manual, through CLI tools or scripts, actively injecting pressure into the cluster to obtain the cluster response, so there are a lot of shortcomings.

When manual input commands are used to realize inspection, it can be difficult to implement due to the large cluster size, high frequency of inspection, or the complexity of the inspection process.
When a shell programming approach is used to implement inspections, raising the threshold of inspections for O&M personnel and programming bugs affect the accuracy of inspection conclusions.
When multiple hair presses are required to increase the number of requests and connections, configuration tuning of the hair presses is required at a higher cost, raising the cost of preparing the pressure test environment.
Issues such as needed tuning' of test tools andinexperience in configuration' made the ability to issue pressures limited, and the tests did not serve the intended purpose, producing erroneous conclusions.
For K8s applications rely more on the product's own inspection capability to confirm the cluster status by collecting information such as application metrics, logs, status, etc., The limited metrics information generated by the application does not allow for a complete conclusion of the inspection.
For a large-scale K8s cluster, it is desirable to confirm the network connectivity of PODs among all nodes to avoid network failure in one node and to find out whether there are occasional packet loss problems in the network, and there are many communication channels, including Pod IP, ClusterIP, NodePort, Loadbalancer IP, Ingress IP, and even PODs with multiple NICs and dual-stack IPs, and the manual method of inspection is inefficient and the maintenance cost is high.
For different applications need to use different tools to check, such as dns service, business application service, disks, etc., which requires O&M personnel to have in-depth knowledge of different inspection tools, greatly increasing the threshold of O&M personnel.
Different inspection tools have different inspection report styles, and it is not possible to show a detailed report of inspection results in a cloud-native style.

03 solution：kdoctor

kdoctor is a Kubernetes data-plane testing component based on active pressure injection for functional, performance testing of clusters. By researching and abstracting the regular O&M needs of O&M personnel, it allows cloud-native implementation of O&M tasks such as network, storage, and application, based on a CRD design that is capable of interfacing with observable components.

The kdoctor contains the following 3 main types of inspections：

kdoctor NetReach: Performs connectivity patrols on Pod IP, ClusterIP, NodePort within the cluster based on task configuration, ClusterIP, NodePort, Loadbalancer IP, Ingress IP, and even POD multi-network card and dual-stack IPs in the cluster according to the task configuration.
kdoctor AppHttpHealthy: specify the access address inside and outside the cluster according to the task configuration, and check the connectivity using HTTP HTTP and HTTPS protocols for connectivity checking, supporting PUT, GET, POST and other request methods.
kdoctor NetDns: according to the task configuration, performs connectivity checking on specified DNS Servers inside and outside the cluster, supports udp, tcp, tcp-tls protocols.

The kdoctor solves the traditional active inspection problem with the following design：

By issuing a CRD to configure the inspection task requirements, the user only needs to focus on the inspection target, inspection frequency, pressure generation parameters, and desired inspection results.
kdoctor reads the task configuration and runs the pressurizing agent as Deployment or DaemonSet to achieve the effect of multiple pressurizing machines.
kdoctor will use the default agent or create a new agent to execute the task according to the specification of the task, in order to achieve resource reuse and task resource isolation.
kdoctor will bind the corresponding resource target, such as ingress, service, each agent pod according to the task configuration mutual access to the bound resources, according to the request results to draw conclusions.
kdocotr's pressure-sending client is performance tuned to greatly reduce resource consumption during pressure-sending requests.
kdoctor's inspection reports are output through logs, aggregated api, file drop and so on.

04 Installation and Usage

Install kdoctor according to the official documentation for kdoctor.

In this paper, we use NetReach as an example for cluster connectivity patrol.

The cluster connectivity patrol task NetReach is issued, and the task will execute a round of tasks lasting 10s, where the default agent of each node accesses the IPv4 addresses of ClusterIP, Endpoint, NodePort, and LoadBalancer to each other using the http protocol, and executes them immediately.

cat <<EOF | kubectl apply -f -
apiVersion: kdoctor.io/v1beta1
kind: NetReach
metadata:
  name: reach-task
spec:
  expect:
    meanAccessDelayInMs: 1500
    successRate: 1
  request:
    durationInSecond: 10
    perRequestTimeoutInMS: 1500
    qps: 10
  schedule:
    roundNumber: 1
    roundTimeoutMinute: 1
    schedule: 0 1
  target:
    clusterIP: true
    endpoint: true
    ingress: false
    ipv4: true
    loadBalancer: false
    multusInterface: false
    nodePort: true
EOF

View Inspection Tasks

~# kubectl get netreach
NAME         FINISH   EXPECTEDROUND   DONEROUND   LASTROUNDSTATUS   SCHEDULE
reach-task   true     1               1           succeed           0 1

View the inspection task report

The kdoctor controller aggregates the inspection task reports and displays them via an aggregation API.

~# kubectl get kdoctorreport  reach-task -oyaml
apiVersion: system.kdoctor.io/v1beta1
kind: KdoctorReport
metadata:
  creationTimestamp: null
  name: reach-task
spec:
  FailedRoundNumber: null
  FinishedRoundNumber: 1
  Report:
  - EndTimeStamp: "2023-09-21T11:30:33Z"
    NetReachTask:
      Detail:
      - MeanDelay: 50.294117
        Metrics:
          Duration: 15.004307799s
          EndTime: "2023-09-21T11:30:33Z"
          Errors: {}
          Latencies:
            Max_inMx: 0
            Mean_inMs: 50.294117
            Min_inMs: 0
            P50_inMs: 0
            P90_inMs: 0
            P95_inMs: 0
            P99_inMs: 0
          RequestCounts: 102
          StartTime: "2023-09-21T11:30:18Z"
          StatusCodes:
            "200": 102
          SuccessCounts: 102
          TPS: 6.798047691796755
          TotalDataSize: 39295 byte
        Succeed: true
        SucceedRate: 1
        TargetMethod: GET
        TargetName: AgentClusterV4IP_10.233.32.45:80
        TargetUrl: http://10.233.32.45:80
        ....
        Succeed: true
        SucceedRate: 1
        TargetMethod: GET
        TargetName: AgentPodV4IP_kdoctor-netreach-reach-task-pmndx_10.233.74.96
        TargetUrl: http://10.233.74.96:80
    NodeName: worker-node-1
    PodName: kdoctor-netreach-reach-task-lwbtk
    ReportType: agent test report
    RoundDuration: 15.049239468s
    RoundNumber: 1
    RoundResult: succeed
    StartTimeStamp: "2023-09-21T11:30:18Z"
    TaskName: netreach.reach-task
    TaskType: NetReach
  ReportRoundNumber: 1
  RoundNumber: 1
  Status: Finished
  TaskName: reach-task
  TaskType: NetReach

05 summarize

kdoctor is positioned not to replace traditional, professional testing tools, nor to implement a complete inspection solution, but to provide a simple, fast, efficient, cloud-native O&M testing tool`, to fill the functional gaps in the current O&M testing, to reduce the burden on O&M and to dock the results of the inspections into the product's ecosystem.

大规模集群下，如何快速实现无死角网络连通性的主动巡检

ii2day — Mon, 15 Jan 2024 09:11:52 +0000

01 什么是巡检

集群巡检是对集群系统进行定期检查和评估的过程，其主要目的是确保集群的稳定性、性能和安全性。以下是集群巡检的几个主要用途：

故障排除和问题诊断：巡检可以帮助发现集群中的故障和问题，并提供诊断和解决方案。通过检查集群的各个组件、配置和运行状况，可以及时发现潜在的故障源和性能瓶颈，并采取适当的措施进行修复。
性能优化：巡检可以评估集群的性能和资源利用情况。通过分析集群的负载、资源分配和配置，可以发现性能瓶颈、资源浪费和瓶颈等问题，并提供优化建议，以提高集群的性能和效率。
安全审计和合规性：巡检可以检查集群的安全性和合规性，包括访问控制、身份验证、数据保护等方面。通过审计集群的安全配置、漏洞管理和合规性规定，可以发现潜在的安全风险和合规性问题，并采取相应的措施进行修复和合规性调整。
容量规划和伸缩性：通过巡检，可以评估集群的容量使用情况和伸缩性需求。这有助于预测未来的资源需求、规划扩展策略，并提供建议来确保集群具有足够的容量和伸缩性，以满足业务的增长和变化。
高可用性和冗余策略：巡检可以评估集群的高可用性和冗余策略。通过检查集群的故障转移、备份和恢复机制，可以发现潜在的单点故障和可用性问题，并提供建议来增强集群的可靠
性和冗余性。

02 传统网络主动巡检痛点

主动巡检多采用手工方式，通过 CLI 工具或者脚本，向集群主动注入压力，获取集群响应情况，因此存在很多不足之处。

当采用手动输入命令实现巡检时，会因为集群规模大、巡检频率高或巡检流程复杂等原因而难以实施。
当采用 shell 编程方式实现巡检时，提高了运维人员的巡检门槛，编程 bug 影响了巡检结论的准确性。
当需要多台发压机时，以提高请求量和连接数，需要对发压机进行配置调优成本较大，提高了压测环境的准备成本。
测试工具需要调优、配置经验不足等问题，使得发压能力有限，测试不能达到预期目的，产生了错误的结论。
对于 K8s 的应用更多的依赖产品自身的巡检能力，通过采集应用指标、日志、状态等信息来确认集群状态，应用产生的指标信息有限，无法完整得出巡检结论。
对于大规模 K8s 集群，希望确认所有节点间 POD 的网络连通性，避免某个节点存在网络故障，发现网络中是否存在偶发丢包问题，而通信渠道非常多，包括 Pod IP、ClusterIP、NodePort、Loadbalancer IP、Ingress IP, 甚至是 POD 多网卡、双栈IP，手工方式巡检的效率低下，且维护成本较高。
对于不同的应用需要使用不同的工具检查，如 dns 服务、业务应用服务、磁盘等，需要运维人员对不同的巡检工具有深入了解，大大提高了运维人员的门槛。
不同巡检工具的巡检报告样式不同，无法云原生式的展示出巡检结果的详细报告。

03 解决方案：kdoctor

kdoctor 是一个基于主动式压力注入的 Kubernetes 数据面测试组件，对集群进行功能、性能的测试。通过调研和抽象了运维人员的常规运维需求，让网络、存储、应用等运维任务进行了云原生实现，基于 CRD的设计，能够对接观测性组件。

kdoctor 主要包含以下 3 个类型巡检：

kdoctor NetReach：根据任务配置对集群内 Pod IP、ClusterIP、NodePort、Loadbalancer IP、Ingress IP, 甚至是 POD 多网卡、双栈IP进行连通性巡检。
kdoctor AppHttpHealthy：根据任务配置对集群内外指定访问地址，使用 HTTP、HTTPS 协议进行连通性检查，支持 PUT、GET、POST 等多种请求方式。
kdoctor NetDns：根据任务配置，对集群内外的指定 DNS Server 进行连通性检测，支持 udp、tcp、tcp-tls 协议。

kdoctor 通过如下设计解决传统主动巡检问题：

通过下发 CRD 配置巡检任务需求，使用者只需要关注巡检目标、巡检频率、发压参数以及期望巡检结果。
kdoctor 通过读取任务配置，以 Deployment 或 DaemonSet 的方式运行发压 agent，以达到多台发压机器的效果。
kdoctor 会根据任务的 spec 配置，使用 default agent 或创建新的 agent 执行任务，以达到资源重复利用和任务资源隔离。
kdoctor 会绑定相对应的资源目标，如 ingress 、service，每一个 agent pod 根据任务配置相互访问绑定的资源，根据请求结果得出结论。
kdocotr 的发压 client 通过性能调优，大大降低了发压请求时的资源消耗。
kdoctor 的巡检报告通过日志、聚合 api 、文件落盘等方式输出。

04 安装与使用

根据 kdoctor 的官方文档安装 kdoctor。

本文以 NetReach 为例，进行集群联通性巡检。

下发集群连通性巡检任务 NetReach，任务将执行一轮持续 10s 的任务，每个节点的 default agent 会相互使用 http 协议访问 ClusterIP、Endpoint、NodePort、LoadBalancer 的 IPv4 地址，并立即执行。

cat <<EOF | kubectl apply -f -
apiVersion: kdoctor.io/v1beta1
kind: NetReach
metadata:
  name: reach-task
spec:
  expect:
    meanAccessDelayInMs: 1500
    successRate: 1
  request:
    durationInSecond: 10
    perRequestTimeoutInMS: 1500
    qps: 10
  schedule:
    roundNumber: 1
    roundTimeoutMinute: 1
    schedule: 0 1
  target:
    clusterIP: true
    endpoint: true
    ingress: false
    ipv4: true
    loadBalancer: false
    multusInterface: false
    nodePort: true
EOF

查看巡检任务

~# kubectl get netreach
NAME         FINISH   EXPECTEDROUND   DONEROUND   LASTROUNDSTATUS   SCHEDULE
reach-task   true     1               1           succeed           0 1

查看巡检任务报告

kdoctor controller 会将巡检任务报告聚合并通过聚合 API 的方式进行展示。

~# kubectl get kdoctorreport  reach-task -oyaml
apiVersion: system.kdoctor.io/v1beta1
kind: KdoctorReport
metadata:
  creationTimestamp: null
  name: reach-task
spec:
  FailedRoundNumber: null
  FinishedRoundNumber: 1
  Report:
  - EndTimeStamp: "2023-09-21T11:30:33Z"
    NetReachTask:
      Detail:
      - MeanDelay: 50.294117
        Metrics:
          Duration: 15.004307799s
          EndTime: "2023-09-21T11:30:33Z"
          Errors: {}
          Latencies:
            Max_inMx: 0
            Mean_inMs: 50.294117
            Min_inMs: 0
            P50_inMs: 0
            P90_inMs: 0
            P95_inMs: 0
            P99_inMs: 0
          RequestCounts: 102
          StartTime: "2023-09-21T11:30:18Z"
          StatusCodes:
            "200": 102
          SuccessCounts: 102
          TPS: 6.798047691796755
          TotalDataSize: 39295 byte
        Succeed: true
        SucceedRate: 1
        TargetMethod: GET
        TargetName: AgentClusterV4IP_10.233.32.45:80
        TargetUrl: http://10.233.32.45:80
        ....
        Succeed: true
        SucceedRate: 1
        TargetMethod: GET
        TargetName: AgentPodV4IP_kdoctor-netreach-reach-task-pmndx_10.233.74.96
        TargetUrl: http://10.233.74.96:80
    NodeName: worker-node-1
    PodName: kdoctor-netreach-reach-task-lwbtk
    ReportType: agent test report
    RoundDuration: 15.049239468s
    RoundNumber: 1
    RoundResult: succeed
    StartTimeStamp: "2023-09-21T11:30:18Z"
    TaskName: netreach.reach-task
    TaskType: NetReach
  ReportRoundNumber: 1
  RoundNumber: 1
  Status: Finished
  TaskName: reach-task
  TaskType: NetReach

05 总结

kdoctor 定位，不是取代传统的、专业的测试工具，也不是为了实施一个完整的巡检解决方案，而是希望提供一个简单、快速、高效、云原生化的运维测试工具，弥补当前运维测试中的功能空白，降低运维负担，并把检查结果对接到产品的生态中。