DEV Community

cylon
cylon

Posted on

9 2

Envoy:离群检测 outlier detection

outlier detection

在异常检测领域中,常常需要决定新观察的点是否属于与现有观察点相同的分布(则它称为inlier),或者被认为是不同的(称为outlier)。离群是异常的数据,但是不一定是错误的数据点。

在Envoy中,离群点检测是动态确定上游集群中是否有某些主机表现不正常,然后将它们从正常的负载均衡集群中删除的过程。outlier detection可以与healthy check同时/独立启用,并构成整个上游运行状况检查解决方案的基础。

此处概念不做过多的说明,具体可以参考官方文档与自行google

监测类型

  • 连续的5xx
  • 连续的网关错误
  • 连续的本地来源错误

更多介绍参考官方文档 outlier detection

离群检测测试

说明,此处只能在单机环境测试更多还的参考与实际环境

环境准备

docker-compose 模拟后端5个节点

version: '3'
services:
  envoy:
    image: envoyproxy/envoy-alpine:v1.15-latest
    environment: 
    - ENVOY_UID=0
    ports:
    - 80:80
    - 443:443
    - 82:9901
    volumes:
    - ./envoy.yaml:/etc/envoy/envoy.yaml
    networks:
      envoymesh:
        aliases:
        - envoy
    depends_on:
    - webserver1
    - webserver2

  webserver1:
    image: sealloong/envoy-end:latest
    networks:
      envoymesh:
        aliases:
        - myservice
        - webservice
    expose:
    - 90
  webserver2:
    image: sealloong/envoy-end:latest
    networks:
      envoymesh:
        aliases:
        - myservice
        - webservice
    expose:
    - 90
  webserver3:
    image: sealloong/envoy-end:latest
    networks:
      envoymesh:
        aliases:
        - myservice
        - webservice
    expose:
    - 90
  webserver4:
    image: sealloong/envoy-end:latest
    networks:
      envoymesh:
        aliases:
        - myservice
        - webservice
    expose:
    - 90
  webserver5:
    image: sealloong/envoy-end:latest
    networks:
      envoymesh:
        aliases:
        - myservice
        - webservice
    expose:
    - 90
networks:
  envoymesh: {}
Enter fullscreen mode Exit fullscreen mode

envoy 配置文件

admin:
  access_log_path: /dev/null
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 80 }
    filter_chains:
    - filters:
      - name: envoy_http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          codec_type: AUTO
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: [ "*" ]
              routes:
              - match: { prefix: "/" }
                route: { cluster: local_service }
          http_filters:
          - name: envoy.filters.http.router

  clusters:
  - name: local_service
    connect_timeout: 0.25s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: local_service
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address: { address: webservice, port_value: 90 }
    health_checks:
      timeout: 3s
      interval: 90s
      unhealthy_threshold: 5
      healthy_threshold: 5
      no_traffic_interval: 240s
      http_health_check:
        path: "/ping"
        expected_statuses:
          start: 200
          end: 201
    outlier_detection:
      consecutive_5xx: 2
      base_ejection_time: 30s
      max_ejection_percent: 40
      interval: 20s
      success_rate_minimum_hosts: 5
      success_rate_request_volume: 10
Enter fullscreen mode Exit fullscreen mode

配置说明

    outlier_detection:
      consecutive_5xx: 2 # 连续的5xx错误数量
      base_ejection_time: 30s # 弹出主机的基准时间。实际时间等于基本时间乘以主机弹出的次数
      max_ejection_percent: 40 # 可弹出主机集群的最大比例,默认值为10% ,此处为40% 即集群中5个节点的2个节点
      interval: 20s # 间隔时间
      success_rate_minimum_hosts: 5 # 集群中最小主机数量
      success_rate_request_volume: 10 # 在一个时间间隔内中收集请求检测的最小数量
Enter fullscreen mode Exit fullscreen mode

此处为了效果,将主动检测状态时间增加,主机弹出时间增加

路由

/502bad 模拟一个502的错误

运行结果

模拟一些5xx请求和200请求

 workers
envoy_1       | [2020-09-13 06:10:01.093][1][warning][main] [source/server/server.cc:537] there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections
webserver2_1  | [GIN] 2020/09/13 - 06:10:08 | 200 |      63.272?s |      172.22.0.7 | GET      "/"
webserver5_1  | [GIN] 2020/09/13 - 06:10:10 | 200 |      46.732?s |      172.22.0.7 | GET      "/"
webserver1_1  | [GIN] 2020/09/13 - 06:10:11 | 200 |       45.43?s |      172.22.0.7 | GET      "/"
webserver3_1  | [GIN] 2020/09/13 - 06:10:13 | 502 |      43.858?s |      172.22.0.7 | GET      "/502bad"
webserver4_1  | [GIN] 2020/09/13 - 06:10:14 | 502 |      47.486?s |      172.22.0.7 | GET      "/502bad"
webserver2_1  | [GIN] 2020/09/13 - 06:10:15 | 200 |      15.691?s |      172.22.0.7 | GET      "/"
webserver5_1  | [GIN] 2020/09/13 - 06:10:16 | 200 |      14.719?s |      172.22.0.7 | GET      "/"
webserver1_1  | [GIN] 2020/09/13 - 06:10:16 | 200 |      15.758?s |      172.22.0.7 | GET      "/"
webserver3_1  | [GIN] 2020/09/13 - 06:10:17 | 502 |      15.697?s |      172.22.0.7 | GET      "/502bad"
webserver2_1  | [GIN] 2020/09/13 - 06:10:17 | 502 |      14.002?s |      172.22.0.7 | GET      "/502bad"
webserver5_1  | [GIN] 2020/09/13 - 06:10:17 | 502 |      14.913?s |      172.22.0.7 | GET      "/502bad"
webserver1_1  | [GIN] 2020/09/13 - 06:10:18 | 502 |      14.911?s |      172.22.0.7 | GET      "/502bad"
webserver4_1  | [GIN] 2020/09/13 - 06:10:18 | 502 |      30.429?s |      172.22.0.7 | GET      "/502bad"
webserver5_1  | [GIN] 2020/09/13 - 06:10:19 | 200 |      14.377?s |      172.22.0.7 | GET      "/"
webserver1_1  | [GIN] 2020/09/13 - 06:10:19 | 200 |      14.861?s |      172.22.0.7 | GET      "/"
webserver2_1  | [GIN] 2020/09/13 - 06:10:19 | 200 |      18.924?s |      172.22.0.7 | GET      "/"
webserver5_1  | [GIN] 2020/09/13 - 06:10:19 | 200 |      15.899?s |      172.22.0.7 | GET      "/"
webserver1_1  | [GIN] 2020/09/13 - 06:10:19 | 200 |      24.849?s |      172.22.0.7 | GET      "/"
Enter fullscreen mode Exit fullscreen mode

集群已弹出 20%的节点,健康检查结果为 failed_outlier_check

请求已分配到其余三台节点

30秒后,弹出主机已回复正常

再次模拟请求

30秒后,如在时间间隔内,无新增请求,节点依旧为 failed_outlier_check,有新增请求时恢复。

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay