DEV Community: spacewander

Uncounted Tokens: The Game of Attack and Defense in AI Gateway Rate Limiting

spacewander — Fri, 12 Dec 2025 11:28:50 +0000

Attack

AI gateways typically feature a specific function: performing rate limiting based on token consumption. In some contexts, this is called ai-rate-limiting, while in others, it is known as ai-quota. Regardless of the name, the principle remains the same: it relies on the token usage information returned at the end of an inference request.

The method to bypass these restrictions becomes evident: one simply needs to find a way to prevent the gateway from seeing the usage information at the end of the inference request. Sometimes, users can bypass these limits unintentionally. For example, in the OpenAI chat interface, usage is not returned by default during streaming unless the user explicitly specifies include_usage in the request: https://platform.openai.com/docs/api-reference/chat/create#chat_create-stream_options-include_usage.

Suppose the model provider always provides token usage, or the gateway performs a "hack" when processing user requests to inject include_usage, ensuring that token usage is always present at the end of an inference request. What then?

There is still a way: force the inference request to terminate early. We can insert a specific prompt that specifies the output of a "stop word" upon completion, followed by the execution of a time-consuming task. When the client receives this stop word, it can safely disconnect. As long as the request is interrupted prematurely, the gateway will not maintain the connection with the upstream provider and, naturally, will not receive the final usage report sent by the upstream.

Of course, certain configurations can modify this behavior, such as Nginx's proxy_ignore_client_abort. However, doing so carries a risk: if a legitimate client wants to terminate inference early, but the gateway continues communicating with the upstream, it could result in the user being overcharged.

While these little tricks can fool middleware, the inference engine side still knows exactly how many input tokens were received during prefill and how many output tokens were generated during decode. Therefore, the final bill sent to you will still be correct.

Defense

The various attack methods mentioned above essentially reveal a structural pain point in current AI gateway architectures regarding streaming scenarios: asynchronous billing. In the traditional Request-Response model, gateways can easily intercept and calculate traffic. However, in LLM streaming interactions, token consumption is a dynamic process that accumulates over time, and accurate token usage reporting often lags behind the end of the request. As long as the gateway relies on "post-event" reporting, clients have the opportunity to use disconnection strategies to create a "billing black hole."

So, is there a reliable way to calculate the actual token usage during communication without relying on the token usage information in the inference request?

The simplest and crudest method is to multiply the number of bytes by a "magic number" coefficient to serve as a fallback when token usage cannot be found. If you can turn a blind eye to accuracy, this is the solution with the lowest overhead.

The "official" approach is to call the model provider's own count token interface. For open-source inference engines like vLLM or TensorRT-LLM, there are corresponding tokenize interfaces. However, requiring the gateway to initiate multiple extra HTTP calls for every request is costly, especially when streaming responses.

Some coding libraries provide local tokenization capabilities, such as:

huggingface/tokenizers
openai/tiktoken and its Go port: pkoukk/tiktoken-go

However, these tokenizers need to know the model's tokenizer configuration to work, and model providers are highly unlikely to publish this data. That said, there are open-source versions of these private models on the market, such as Gemma for Gemini. It is unclear how much difference there is between the tokenizer configurations of these open-source versions and the private ones, or whether the results based on their configurations approximate the official count token interface.

If you are using a self-hosted model, theoretically, having the tokenizer configuration allows you to perform tokenization locally without relying on a remote tokenizer service.

Assuming token usage is not provided locally but relies on remote results, for the sake of caution, it is best to add a rate limit based on request count (or byte count, if available) alongside the token-based limit. This ensures that even if the remote end fails to return token usage, the system is not left completely undefended.

Beyond Simple Forwarding – Practical Content Safety in AI Gateways

spacewander — Wed, 03 Dec 2025 12:21:25 +0000

(This article only discusses content safety for text generation, not multimodal.)

Connecting AI inputs/outputs to a content‑safety filtering system is almost a must‑have feature for any AI gateway. For compliance, on the one hand, personal information in the context needs to be desensitized; on the other hand, certain inappropriate statements need to be sanitized. Most content‑safety filtering systems on the market work similarly: they take a piece of text and return processing results (whether to filter, which rules were violated, what text needs to be replaced, etc.). In fact, an AI gateway can have a dedicated content‑safety subsystem placed on the proxy path. Different content‑safety vendors are just different providers for this subsystem; only the integration formats and configs differ, while the basic I/O can be reused.

Input

All LLM providers take JSON as input, so at the input stage you generally parse JSON and extract provider‑specific input fields.

Before we dive in, let me briefly revisit the structure of a chat interface. A chat request looks like this:

system prompt # Optional, built‑in assumptions
---
user prompt   # Usually the user’s input
---
response to user prompt
---
user's next prompt
...

Some AI gateway products, by default, only inspect the latest prompt (in some products, that’s even the only supported behavior). This is actually not safe enough.

For untrusted clients (for example, software running on users’ own machines), the entire conversation history is supplied by the attacker, so they can tamper with previous content. The same logic applies if you only check user prompts or only check content other than the system prompt.

What if all calls come from trusted clients — for example, a backend service that always appends user inputs to the end of the conversation? Is it then safe to only inspect the latest prompt? Unfortunately, no. When the model performs inference, it does not only look at the user prompt or the newest prompt; it looks at the whole context. If the content‑safety filter only inspects the latest prompt, its field of view is too narrow and it can’t understand the context. For example:

Assume your content‑safety rule disallows discussing the politics of a certain region.

> A certain region, you know which

< Content not displayed due to relevant laws and regulations

> Tell me about the politics of the region mentioned earlier

If the content‑safety system can only see the most recent message, it has no way of knowing which “region mentioned earlier” is being referred to, and thus can’t block the last request. Of course, developers can remove blocked content from the user’s message history to ensure safety. If your features depend on content filtering, it’s important to understand the boundaries of what this filtering can do.

Output

There are two forms of model text generation: streaming and non‑streaming. Suppose a service uses streaming responses and wants to integrate a content‑safety filtering system. If we simply convert it to non‑streaming (wait for all content to be generated, then call the content‑safety system), we may affect the service. For example, originally the user can see content being generated piece by piece; even if full generation takes a few minutes, they won’t feel impatient. After switching to non‑streaming, however, users have to wait several minutes to see any result at all, and might switch to a competitor instead. So why not just feed the streaming response directly into the content‑safety system? Because unsafe content might happen to be split right across two streaming chunks.

Is there a compromise? Yes — by introducing a delay buffer.

The core idea is: during a streaming response, maintain a buffer that stores the most recently generated content. When the buffer hits a certain size, or times out, or the request ends, you call the content‑safety system to check it. If no unsafe content is found, send everything in the buffer except for the last few characters to the user. Keep those trailing characters in the buffer to guard against unsafe content that spans chunk boundaries; they’ll be processed on the next check. This approach preserves content safety while minimizing the impact on user experience. The underlying intuition is that unsafe content is typically local; it’s not like reading an O. Henry story where you only get a twist at the very end. As long as you retain and check a portion of the most recently generated content, you can effectively catch unsafe content. Specifically:

Receive chunk 1: xxx...xbad c
Run safety check on xxx...xbad c, passes
Send chunk 1: xxx... to the user, keep the trailing "xbad c" in the buffer
Receive chunk 2: ontent...yyy
Concatenate buffer and chunk 2 to get "xbad content...yyy"
Run safety check on xbad content...yyy and discover “bad content” is unsafe

The key is to choose an appropriate buffer size that can catch unsafe content that crosses chunk boundaries, without making the user wait too long. By adjusting buffer size and check frequency, you can strike a balance between content safety and user experience.

Even if you ignore business impact, a buffer is still necessary. Content‑safety systems have an upper limit on the number of characters they can process per request. If the streaming response is too large and you send it directly to the content‑safety system, you might exceed its processing capacity. With a buffer, you can split a long streaming response into multiple smaller segments, send them for checking one by one, and avoid exceeding the system’s capacity.

Conclusion

In content‑safety design for generative text, what looks like a simple “forward everything to a filter” actually hides quite a bit of nuance.

On the input side: don’t only inspect the latest prompt — full context is what the model bases its decisions on.

On the output side: introducing a buffer and retaining the last few characters for segmented checks is a pragmatic way to balance user experience and safety; at the same time you must tune buffer size, check frequency, and timeout strategy.

Agentgateway Review: A Feature-Rich New AI Gateway

spacewander — Tue, 02 Dec 2025 11:23:19 +0000

Introduction

agentgateway is a data plane developed by solo specifically for AI scenarios. The data plane is written in Rust and can be configured via xDS (a gRPC-based protocol) and YAML. Recently they decided to replace kgateway’s AI data plane from Envoy to agentgateway. I expect the enterprise version of Gloo will follow. Previously, most AI-related data-plane features were implemented in Envoy calling a Go sidecar via ext_proc, and I guess the real-world results were mediocre.

This gateway supports four AI scenarios:

MCP
A2A
Proxying inference requests to LLM providers
Load balancing for inference services

Below I explain each of these scenarios. Note I’m discussing the open-source agentgateway — some features may exist only in the enterprise edition and are outside the scope of this doc.

MCP

agentgateway was originally started to address the difficulty of handling stateful MCP requests in existing Envoy data planes. So its MCP support is the most complete.

By default, agentgateway treats MCP as a stateful protocol. It has a SessionManager struct responsible for session creation and maintenance (code link). But this SessionManager is a local in-process store, which means if you run multiple agentgateway instances there’s no guarantee a client will hit the same SessionManager each time. If you want sticky sessions toward upstreams, it’s actually simpler to consistent-hash on the MCP-Session-ID header so the same session ID routes to the same backend even if requests land on different agentgateway instances. Extending SessionManager to use a remote store is another solution, but it’s more expensive. To me, making MCP stateful by default is a mistake. I’m glad they plan to make MCP a default stateless protocol.

When there is more than one backend, agentgateway enables MCP multiplexing. For example, with tools: when listing tools, agentgateway sends tools/list to every backend, then rewrites tool names to the format ${backend_name}_${tool_name}. When a tool call comes in, agentgateway routes to the actual backend. For methods that can’t be multiplexed, it returns an invalid method error.

Besides forwarding to MCP backends, agentgateway supports converting RESTful APIs to MCP tools using an OpenAPI spec. Impressively, it supports using an entire spec as a backend and includes a fair amount of schema-parsing code. agentgateway positions itself here as an MCP-to-RESTful-API forwarder; it does not itself manage the RESTful APIs described in the OpenAPI spec. Some details are still missing — for example, bodies only support application/json, HTTPS upstreams aren’t supported yet, structured output is not yet supported, etc. There are also finer points (e.g., handling of additionalProperties) I haven’t dug fully into.

agentgateway implements OAuth-based MCP authentication. It exposes protected resource metadata at paths like /.well-known/oauth-protected-resource/${resource}. However, if one host contains multiple resources, should each resource’s route-match config explicitly include that resource’s well-known path? Otherwise you can’t guarantee the request will route to the well-known path handler. One nice thing: agentgateway adds CORS headers to metadata responses, so when an MCP client runs in a browser (e.g., the MCP inspector) you don’t need to add a separate CORS middleware.

agentgateway fetches public keys from a JWKS path to verify tokens were issued by the corresponding authorization server. There are two JWKS sources:

The user supplies a URL or a file path.
The JWKS URL is derived from the issuer URL and issuer type.

The code that gets public keys from JWKS appears to be called only when parsing configuration. So the JWKS does not seem to be periodically refreshed.

Authorization is also implemented via OAuth. It uses a list of CEL expressions as filters, matching on JWT fields and MCP attributes. Example:

mcpAuthorization:
  rules:
  # Allow anyone to call 'echo'
  - 'mcp.tool.name == "echo"'
  # Only the test-user can call 'add'
  - 'jwt.sub == "test-user" && mcp.tool.name == "add"'
  # Any authenticated user with the claim `nested.key == value` can access 'printEnv'
  - 'mcp.tool.name == "printEnv" && jwt.nested.key == "value"'

Note: in multiplexing scenarios, mcpAuthorization runs before the tool lists are merged, so the tool names here do not include the backend-name prefix.

agentgateway provides surprisingly few MCP-related metrics — basically just an mcp_requests counter — so you can’t see details like which tools are taking the most time.

A2A

For A2A protocol scenarios, agentgateway implements two main features:

Rewrites agent card URLs so they point to the gateway instead of the proxied backend.
Parses A2A JSON requests and records the request method for observability.

Proxying inference requests to LLM providers

Like other AI gateways, agentgateway can proxy inference requests to LLM providers. This proxying is not just raw forwarding: it adds value such as token-based observability and rate-limiting.

When proxying SSE traffic it collects token usage and TTFT metrics. For non-SSE streaming formats (e.g., Bedrock’s AWS event stream) it provides dedicated parsers.

I’ll dive into rate limiting, prompt protection, and related features in a follow-up.

Another common capability is to lift some LLM client features into the gateway to reduce integration work — for example, smoothing differences between providers and offering an OpenAI-compatible external API.

agentgateway supports this to an extent. Its design is not a full generic "X provider to Y provider" converter; instead it implements conversions for specific routing types. Currently it supports two route types:

OpenAI’s /v1/chat/completions
Anthropic’s /v1/messages

In practice both /v1/chat/completions and /v1/messages are chat-style routes: OpenAI’s /v1/chat/completions is functionally equivalent to Anthropic’s /v1/messages. They implemented both separately for quick business onboarding: many code agents only implement Anthropic’s /v1/messages, and special-casing that endpoint makes it easy to immediately accept such clients. Implementing a full Anthropic-to-any-provider converter would be a much larger effort.

This area is currently roughly sufficient but incomplete. Putting aside support for embeddings, batching, etc., agentgateway does not fully support /v1/chat/completions yet — for example, structured output is not supported at the moment.

Inference Extension Support

When the gateway API inference extension (https://gateway-api-inference-extension.sigs.k8s.io/) first appeared I was skeptical. Distributed inference is a systems engineering problem; it feels presumptuous for a single scheduler implementation to try to become the standard. But with Red Hat driving the LLMD project and treating the inference extension as part of an out-of-the-box experience, the inference extension may gain traction. Red Hat has invested heavily in AI projects (e.g., vLLM) and has the resources to advance this work.

Supporting the inference extension is actually not hard. The gateway needs to forward inference requests to a scheduler (called EPP in the inference extension) via Envoy’s gRPC ext_proc protocol. The scheduler’s response includes an x-gateway-destination-endpoint header that contains the target upstream address. The gateway then forwards the inference request to that address. Practically speaking the gateway is only doing forwarding here; the core logic lives in the scheduler. I’ve wondered: if the entire request is sent to the scheduler, why not let the scheduler process the request directly instead of having the gateway forward it? Is the scheduler only capable of handling input tokens and not output tokens?

What’s the value of a self-hosted inference system? I think it’s to, under data-security constraints, be reasonably cost-competitive with external LLM providers. Large-model inference benefits from scale economics greatly — a self-hosted system is unlikely to beat cloud providers purely on price. To be more cost-effective you need scheduling innovations (e.g., better load balancing, more flexible disaggregated serving). If inference-extension support just means forwarding requests to the official scheduler, then the gateway isn’t adding meaningful value in that part of the chain.

Summary

In summary, agentgateway is impressive for a project that’s been developed for only about half a year. Its feature richness stands out. It shows a clear focus on AI scenarios, and its ambition to rebuild the data plane in Rust (to replace the prior Envoy + Go external process approach) demonstrates strong intent and potential to address AI-specific protocols and performance needs.

However, the documentation is incomplete: some implemented features (e.g., Anthropic /v1/messages support) aren’t documented, while some documented items don’t exist in the code (e.g., the MCP metric list_calls_total referenced in the docs: https://github.com/agentgateway/website/blob/02e25020b185ed34c66704d6274708a24ffe098d/content/docs/mcp/mcp-observability.md?plain=1#L18). Overall these are typical, understandable issues for a rapidly iterating early-stage open-source project and do not substantially detract from the project’s promise.

ebpf 月报 - 2023 年 2 月

spacewander — Mon, 27 Feb 2023 12:18:51 +0000

本刊物旨在为中文用户提供及时、深入、有态度的 ebpf 资讯。

如果你吃了鸡蛋觉得好吃，还想认识下蛋的母鸡，欢迎关注：
笔者的 twitter：https://twitter.com/spacewanderlzx

bpftrace 发布 0.17.0 版本

https://github.com/iovisor/bpftrace/releases/tag/v0.17.0

时隔数月，bpftrace 发布了新版本 0.17.0。这个版本，允许直接比较整数数组，还新增了对以下几个架构的支持：

龙芯：https://github.com/iovisor/bpftrace/pull/2466
ARM32：https://github.com/iovisor/bpftrace/pull/2360

此外，一个较大的改动是支持内核模块的 BTF 文件：
https://github.com/iovisor/bpftrace/pull/2315

bpftrace 以前就已支持了处理内核的 BTF 文件，新版本把这一功能拓展到内核模块上，算是百尺竿头更进一步。

BTF 是 eBPF 世界内的 debuginfo。通过 BTF，我们可以在二进制和程序代码间架起桥梁。举个例子，bpftool 能够 dump 一个 BPF map 中的数据。如果没有 BTF 来注释 BPF map 存储的数据结构，dump 的结果只能是一堆二进制。有了 BTF，才能看得懂在 map 里面存储的信息。

作为一个 tracing 领域的工具，BTF 对于 bpftrace 非常重要。假如没有 BTF，那么 bpftrace 脚本中有时需要显式定义一个内核结构体，比如 https://github.com/iovisor/bpftrace/blob/master/tools/dcsnoop.bt 为了让这段代码能够编译：

    $nd = (struct nameidata *)arg0;
    printf("%-8d %-6d %-16s R %s\n", elapsed / 1e6, pid, comm,
        str($nd->last.name));

需要在文件开头定义相关的结构体：

#include <linux/fs.h>
#include <linux/sched.h>

// from fs/namei.c:
struct nameidata {
        struct path     path;
        struct qstr     last;
        // [...]
};

有了 BTF，就能很自然地使用内核中的结构体定义。

好在较新的内核均已提供了 BTF。如果不幸没有，可以到 btfhub 上找找。

Wasm-bpf：架起 Wasm 和 eBPF 间的桥梁

https://mp.weixin.qq.com/s/2InV7z1wcWic5ifmAXSiew

Wasm 和 eBPF 都是近年来流行的技术，两者结合在一起，会碰撞出怎样的火花？

Wasm-bpf 这个项目给出了自己的答案。

笔者泛泛看了下，外加和开发者讨论，认为该项目主要是想要达到下面两点目标：

让控制器和 ebpf 一样能够跨平台分发
支持将打包完的 Wasm 代码，作为网络 proxy 或者可观测性 agent 的插件

在笔者看来，Wasm-bpf 这个项目未来的发展，更多取决于 Wasm 的生态能不能起来。毕竟在 Wasm 和 eBPF 两者中，Wasm 是相对缺乏复杂应用场景的那一个。比方说，如果想要在打包完的 Wasm 代码里面完成数据上报的功能，如果不依靠 Wasm 宿主的能力，那么需要等待 Wasi-socket 这样正在开发中的功能足够成熟。所以现在结合 Wasm 做 eBPF，还更多地处于技术积累的阶段。

老实说，即使对 Wasm 的支持能够更加成熟，也不一定走 eBPF -> Wasm 的路线。比方说，bpf2go能够把 eBPF 程序打包到 Go 代码中，那么用户现在可用 Go 来编写并分发 eBPF 插件，将来也可以走 eBPF -> Go -> Wasm 这条路线。（姑且先忽略 Go 不支持 Wasi 这一现实，毕竟我们的前提是“对 Wasm 的支持能够更加成熟”，所以可以不负责任地幻想一番）

Exein Pulsar 发布 0.5.0

https://github.com/Exein-io/pulsar/releases/tag/v0.5.0

初看还以为 Apache Pulsar 跨界搞 eBPF 了，再看一眼才发现原来是新东方厨艺和新东方英语的区别。Exein 的这个 Pulsar 同样采用了“Pulsar”（脉冲星）这个比喻来形容事件流，只不过它的事件是由部署环境上的系统调用触发的。

像许多同样基于 eBPF 的可观测性的软件一样，Pulsar 也选择了 “控制器 + eBPF 模块” 的架构。跟许多同类软件不同的是，Pulsar 采用 Rust 来作为控制器开发语言，加载 eBPF 的库用的是Aya。他们之所以这么选型，也许是因为 Exein 的人偏好 Rust，且他们的目标环境是 IoT。

Pulsar 采用一个宏来包裹 eBPF 的挂载点：

PULSAR_LSM_HOOK(path_mknod, struct path *, dir, struct dentry *, dentry,
                umode_t, mode, unsigned int, dev);
static __always_inline void on_path_mknod(void *ctx, struct path *dir,
                                          struct dentry *dentry, umode_t mode,
...

这个宏定义如下：

#define PULSAR_LSM_HOOK(hook_point, args...)                                   \
  static __always_inline void on_##hook_point(void *ctx, TYPED_ARGS(args));    \
                                                                               \
  SEC("lsm/" #hook_point)                                                      \
  int BPF_PROG(hook_point, TYPED_ARGS(args), int ret) {                        \
    on_##hook_point(ctx, UNTYPED_ARGS(args));                                  \
    return ret;                                                                \
  }                                                                            \
                                                                               \
  SEC("kprobe/security_" #hook_point)                                          \
  int BPF_KPROBE(security_##hook_point, TYPED_ARGS(args)) {                    \
    on_##hook_point(ctx, UNTYPED_ARGS(args));                                  \
    return 0;                                                                  \
  }

可以看到，它会给每个函数设置两个挂载点，一个是传统的 BPF_PROG_TYPE_KPROBE，另一个是 Linux 5.7+ 引入的 BPF_PROG_TYPE_LSM 类型。
LSM（Linux 安全模块）其实是一套在内核相关函数增加的 hook 框架，开发者可以通过这些 hook 来加入细粒度的安全策略。大名鼎鼎的 selinux 和 apparmor 就都属于一种 LSM 的实现。BPF_PROG_TYPE_LSM 类型旨在允许开发者通过 eBPF 来编写策略代码，挂载到对应的 LSM hook 上。观察上述宏定义，我们可以看到 lsm 挂载点上的函数允许 eBPF 代码里返回一个 ret 值。在 BPF_PROG_TYPE_LSM 类型的 eBPF 中，开发者能够在调用被 hook 的函数之前，返回一个错误码，比如：

SEC("lsm/xxxxx")
int BPF_PROG(xxx, int ret)
{
  // 前一个 hook 返回了非0值，表示该调用已经被拒绝。让我们把错误码继续传递上去
  if (ret) {
    return ret;
  }

  // 做些安全策略
  if (!ok) {
    return -EPERM;
  }
  return 0;
}

当然我们可以看到上述宏定义里其实并没有设置 ret 的值。Pulsar 只是对关键调用做了事件上报，没有做策略判断。这也是为什么它能够在低版本的 Linux 上 fallback 到普通的 BPF_PROG_TYPE_KPROBE。

前面我们提到，LSM 其实是一套在内核中增加的 hook。这一类的 hook 的命名有一套规则，都以 security_ 打头。所以某个 BPF_PROG_TYPE_LSM 的加载点 xxx，也正好对应内核函数 security_xxx。

使用 eBPF 加速 delve trace

https://developers.redhat.com/articles/2023/02/13/how-debugging-go-programs-delve-and-ebpf-faster#the_inefficiencies_of_ptrace

delve 是一个 Go 调试器。类似于 strace，delve 有一个 trace Go 函数调用的功能，也同样是基于 ptrace 系统调用实现的。

本文说明了他们是如何通过 eBPF 让 trace 的速度比起之前有了天壤之别。原理很简单：用 eBPF 的 uprobe 换掉了 ptrace 系统调用。没有了频繁的系统调用，性能自然上去了。

在这篇文章中，作者提到 eBPF 后端是实验性的。确实如此，我尝试使用 eBPF 后端的体验并不如原本的 ptrace 实现。比如 ptrace 下，支持用如下方式打印涉及函数的调用栈：

$ ./go/bin/dlv trace -s 3 '.*Printf.*' --exec ./go/bin/dlv
...
> goroutine(1): fmt.(*pp).doPrintf((*fmt.pp)(0xc0000a6a90), "%%-%ds", []interface {} len: 824635347800, cap: 824635347800, [...])
        Stack:
                0  0x00000000004f91af in fmt.(*pp).doPrintf
                     at /usr/local/go/src/fmt/print.go:1021
                1  0x00000000004f3719 in fmt.Sprintf
                     at /usr/local/go/src/fmt/print.go:239
                2  0x0000000000962e3f in github.com/spf13/cobra.rpad
                     at ./go/pkg/mod/github.com/spf13/cobra@v1.1.3/cobra.go:153
                3  0x00000000004675a9 in runtime.call32
                     at :0
                (truncated)
        Stack:

而 eBPF 后端目前并不支持打印调用栈。如果没有调用栈信息，其实很难知道某个函数是否在恰当的时机被调用。况且在非生产环境上，ptrace 的实现已经足够快了。所以 eBPF 后端目前的功能就挺鸡肋，只适合于在生产环境上了解某个函数是否被调用，而且对环境的要求比较高，又不如 strace 那么通用。

如果只是想知道函数有没有被调用到，用 bpftrace 也能达到同样的效果：

$ bpftrace -e 'uprobe:./go/bin/dlv:"fmt.(*pp).doPrintf" {printf("%s\n", ustack(3));}' -c './go/bin/dlv exec ./go/bin/dlv'
...
fmt.(*pp).doPrintf+0
        github.com/go-delve/delve/pkg/terminal.New+2103
        github.com/go-delve/delve/cmd/dlv/cmds.connect+528

用下面的通配符形式，会更接近前面 dlv trace 的效果：

bpftrace -e 'uprobe:./go/bin/dlv:*Printf* {printf("%s\n", ustack(3));}' -c './go/bin/dlv exec ./go/bin/dlv'

细心的读者可能注意到了，我这里执行的命令换成了 ./go/bin/dlv exec ./go/bin/dlv。这是因为 bpftrace 有个 bug，如果 traced 的进程比 bpftrace 先退出，堆栈信息中的有些函数就只显示地址。

Rambling about load balancing algorithms

spacewander — Sun, 12 Feb 2023 13:16:49 +0000

Load balancing is a big topic and we can talk about:

slow start: assigning lower weights to newly added nodes to avoid overloading
priority: Different availability zones (AZs) have different priorities, and nodes not in the current availability zone are only added as backups when the nodes in the current availability zone are not available
subset: Grouped load balancing, a group will be selected by the load balancing algorithm first, and then a specific node will be selected in the group by the load balancing algorithm.
retry: When load balancing hits retry, there are some additional scenarios to consider. In retry, usually we need to select another node instead of reselecting the current node. In addition, after retrying all nodes, we usually do not retry more than one round.

This article will focus on the cornerstone of each of these features - the load balancing algorithm.

Random

Random load balancing means that a node is selected at random. Since this approach is stateless, it is the easiest load balancing to implement. But this is an advantage for developers, not users. Random load balancing only guarantees a balance in mathematical expectations, not in microscopic scales. It is possible for several requests in a row to hit the same node, just as bad luck always strikes. The black swan event is a shadow that random load balancing cannot erase. The only scenario where I recommend random load balancing is when it is the only option.

RoundRobin

RoundRobin means that each node will be selected in turn. For scenarios where all nodes have the same weight, it is not difficult to implement RoundRobin. You just need to record the currently selected node and then select its next node the next time.

For scenarios where the weights are not the same, it is necessary to consider how to make the selected nodes balanced enough. Suppose there are two nodes A and B with weights 5 and 2. If we just use a simple RoundRobin implementation which we used with the same weights, we get the following result:

A B
A B
A
A
A

Node A will be selected at most 4 times in a row (3 times at the end of the current round plus 1 time in the next round). Considering that the weight ratio of nodes A and B is 2.5:1, this behavior of selecting node A 4 times in a row is not commensurate with the weight ratio of the two nodes.

Therefore, for this scenario, we need to go beyond simple node-by-node polling and make the nodes with different weights as balanced as possible at the micro level. Using the previous example of nodes A and B, a micro-level equilibrium distribution should look like this:

A
A B
A
A
A B

Node A is selected at most 3 times in a row, which is not very different from the weight ratio of 2.5:1.

When implementing the RoundRobin algorithm with weights, please do not invent a new algorithm if possible. RoundRobin implementations with weights are more error-prone. It may happen that it works fine in local development tests and runs OK online for a while, until the user inputs a special set of values and then the imbalance happens. The mainstream implementation should be consulted, and if adjustments need to be made to the mainstream implementation, it is best to provide a mathematical proof.

Next, let's look at how the mainstream implementations -- Nginx and Envoy -- do it.

The Nginx implementation looks roughly like this.

each node has its own current score. Each time you select a node, you iterate through the nodes and add a value to the score that is related to the weight of the node.
the node with the highest score is selected each time.
the sum of all weights is subtracted from the score when the node is selected.

The higher the weight of the node, the faster it recovers after subtracting the score, and the more likely it is to continue to be selected. And there is a recovery process here, which ensures that the same node is unlikely to be selected the next time.

This code is complicated by the fact that it is coupled with the passive health check (there are multiple weights; effect_weight needs to be adjusted according to max_fails). Since the specific implementation of Nginx is not the focus of this article, interested readers can check it out for themselves.

Envoy's implementation is a bit clearer. It is based on a simplified version of the EDF algorithm to do node selection. In short, it uses a priority queue to select the current best node. For each node, we record two values.

deadline the next time the node needs to be taken out
last_popped_time the last time this node was taken out

(Envoy's implementation code is a bit different from this. Here we use last_popped_time instead of offset_order in Envoy for the purpose of easy understanding)

Again, take our A and B nodes as an example.

Nodes A and B are given 1/weight as their respective scores. The algorithm runs as follows.

a priority queue is constructed and sorted by comparing deadline first and last_popped_time when the former is the same. the initial value of each node is its respective score.
each time a node is selected, the latest value is popped from the priority queue.
each time a node is selected, its last_popped_time is updated to the deadline at the time of selection, and the corresponding score is added to the deadline and reinserted into the queue.

Each selection is as follows:

round	A deadline	B deadline	A last_popped_time	B last_popped_time	Selected
1	1/5	1/2	0	0	A
2	2/5	1/2	1/5	0	A
3	3/5	1/2	2/5	0	B
4	3/5	1	2/5	1/2	A
5	4/5	1	3/5	1/2	A
6	1	1	4/5	1/2	B
7	6/5	1	4/5	1	A

As you can see, with the EDF algorithm, node A is selected at most 3 times in a row (1 at the end of the current loop plus 2 at the next loop), which is not much different from the weight ratio of 2.5:1. In addition, compared to Nginx's algorithm, the time complexity of node selection under EDF is mainly O(log) when reinserting, which is faster than comparing scores node by node when there are a large number of nodes.

Least Request

The Least Request algorithm, also known as the Least Connection algorithm, comes from the early days when each request often corresponded to one connection and was often used for load balancing long connections. If the load of the service is closely related to the number of current requests, for example, in a push service where the number of connections managed by each node is expected to be balanced, then an ideal choice would be to use the least request algorithm. Alternatively, if the requests take a long time and vary in length, using the least request algorithm also ensures that the number of requests to be prepared for processing on each node is balanced to avoid long queues. For this case, the EWMA algorithm mentioned later is also suitable.

To implement the least request algorithm, we need to keep track of the current number of requests on each node. A request is added one when it comes in and subtracted one when it ends. For the case where all nodes have the same weight, the O(n) traversal is used to find the least requested node. We can also optimize it further. By P2C algorithm, we can randomly select two nodes at a time and achieve an approximate effect like an O(n) traversal, just with O(1) time complexity. In fact, the P2C algorithm can be used to optimize the time complexity for all cases that satisfy the following conditions:

each node has a score
all nodes have the same weight

So some frameworks will directly abstract a p2c middleware as a generic capability.

There is no way to use the P2C algorithm when it involves different weights for each node. We can adjust the weight to the current number of requests, and make it weight / (1 + number of requests). The more requests a node gets, the more the current weight is reduced. For example, if a node with weight 2 has 3 requests, then the adjusted weight is 1/2. If a new request comes in, then the weight becomes 2/5. By dynamically adjusting the weight, we can turn the least request with weight into a RoundRobin with weight, and then use traversal or priority queues to process it.

Hash

There are times when a client needs to be guaranteed access to a fixed server. For example, it is required to proxy requests from clients of the same session to the same node, or to route to a fixed node based on the client IP. In this case we need to use Hash algorithm to map the client's features to a node. However, a simple Hash has the problem that if the number of nodes changes, it will amplify the number of requests affected.

Suppose this simple Hash is to take the number of nodes as a remainder, and the requests are for numbers 1 to 10. The number of nodes starts out at 4 and then becomes 3. The result is:

1: 0 1 2 3 0 1 2 3 0 1
2: 0 1 2 0 1 2 0 1 2 0

We can see that 70% of the requests correspond to nodes that have changed, much more than the 25% change in the number of nodes.

So in practice, we use Consistent Hash more often, and only consider the general Hash algorithm if it is not available.

Consistent Hash

Consistent Hash is an algorithm designed to reduce the number of significant changes in the result when re-hashing. In the previous Hash algorithm, since the result of the Hash is strongly correlated with the number of nodes, once the number of nodes changes, the Hash result changes drastically. So can we make the Hash result independent of the number of nodes? Consistent Hash provides us with a new idea.

The most common consistent Hash algorithm is ring hash, where the entire Hash space is considered as a ring, and each node is mapped to a point on the ring by the Hash algorithm, and a Hash value is calculated for each request, and the nearest node is found clockwise according to the Hash value. In this way, there is no relationship between the requested Hash value and the number of nodes. When the nodes on the ring change, the requested Hash value does not change, only the nearest node may be different.

The reader may ask the question, if the position of a node depends on the value of Hash, how can we ensure that it is balanced distribution? Hash algorithms are designed with the possibility of reducing collisions. A high quality algorithm should spread the result of the Hash mapping as much as possible. Of course, if the Hash is done on only a limited number of nodes, the results will inevitably not be spread out enough. Therefore, the concept of virtual nodes is introduced in Consistent Hash. Each real node will correspond to N virtual nodes, say 100. The Hash value of each virtual node is obtained by an algorithm like Hash(node + "_" + virtual_node_id). Thus a real node, will correspond to N virtual nodes on the Hash ring. From a statistical point of view, we can assume that as long as the value of N is large enough, the standard deviation of the distances between nodes will be smaller and the distribution of nodes on the ring will be more balanced.

However, N can not be infinitely large. Even virtual nodes on the ring need real memory addresses to record their locations. the larger N is obtained, the more balanced the nodes are, but the more memory is consumed. The Maglev algorithm is another consistent Hash algorithm designed to optimize memory consumption. This algorithm uses less memory while guaranteeing the same balance due to the use of different data structures. (Or provide better balance while using the same memory, depending on which part is the same).

EWMA

The EWMA (Exponential Weighted Moving Average) algorithm is an algorithm that uses response time for load balancing. As the name suggests, it is calculated as an "exponentially weighted moving average".

Suppose the current response time is R, the time since the last visit is delta_time, and the score at the last visit is S1, then the current score S2 is:
S2 = S1 * weight + R * (1.0 - weight), where weight = e ^ -delta_time/k. k is a pre-fixed constant in the algorithm.

It is Exponential Weighted: the longer the last visit is from the present, the less it affects the current score.
It is Moving: the current score is adjusted from the last score.
It is Average: if delta_time is large enough, weight is small enough and the score is close to the current response time; if delta_time is small enough, weight is large enough and the score is close to the last score. Overall, the score is the result of adjusting the response time over time.

Careful readers will ask, since the weight is calculated by delta_time, where should the weight specified by the user in the configuration be placed? EWMA is an adaptive algorithm that dynamically adjusts to the upstream state. If you find that you need to configure weights, then your scenario is not suitable for using EWMA. in fact, since the EWMA algorithm does not worry about weights, many people consider it as a replacement for the slow start feature.

But EWMA is not a panacea. Since EWMA is a response time based algorithm, it does not work if the upstream response time has little to do with the upstream node, such as the push scenario mentioned earlier in the introduction of the least request algorithm, where the response time depends on the push policy, which is not a good match for EWMA.

In addition, the EWMA algorithm has an inherent flaw - the response time does not necessarily reflect the full picture of the problem. Imagine a scenario where a node upstream keeps throwing 500 status code errors fast. In the opinion of the EWMA algorithm, this node is an excellent node, after all, it has an unparalleled response time. As a result, the majority of the traffic will hit this node. So when you use EWMA, be sure to also turn on health checks and take off problematic nodes in a timely manner. There are times, though, when a 4xx status code can also cause traffic imbalance. For example, in a grayscale upgrade, an incorrect check is added to the new version that will reject some of the correct requests on the production environment (returning a 400 status code). Since EWMA tends to favor more responsive nodes, more requests will fall to this faulty version.

Turning Rainbow into Bridge - How Nginx Proxies UDP "Connections"

spacewander — Sun, 29 Jan 2023 12:42:24 +0000

As you know, UDP is not connection-based like TCP. However, there are times when we need to send multiple UDPs to a fixed address to complete a UDP request. In order to ensure that the server knows that these UDP packets constitute the same session, we need to bind a port when sending UDP packets so that those UDP packets can be separated together when the network stack is differentiated by a five-tuple (protocol, client IP, client port, server IP, server port). Normally we would call this phenomenon a UDP connection.

But then there is a new problem. Unlike TCP, where there is a handshake and a wave, a UDP connection simply means using a fixed client port. Although as a server, you know where a UDP connection should terminate because you have a fixed set of protocols agreed with the client in advance. But when a proxy server is used in the middle, how does the proxy distinguish that certain UDP packets belong to a certain UDP connection? After all, without a handshake and a wave as separators, an intermediary does not know where to put a period on a session.

We'll see how Nginx handles this problem in the following experiments.

Experiments

For the next few experiments, I'll be using a fixed client. This client will establish a UDP 'connection' to the address Nginx is listening to, and then send 100 UDP packets.

// save it as main.go, and run it like `go run main.go`
package main

import (
    "fmt"
    "net"
    "os"
)

func main() {
    conn, err := net.Dial("udp", "127.0.0.1:1994")
    if err ! = nil {
        fmt.Printf("Dial err %v", err)
        os.Exit(-1)
    }
    defer conn.Close()

    msg := "H"
    for i := 0; i < 100; i++ {
        if _, err = conn.Write([]byte(msg)); err ! = nil {
            fmt.Printf("Write err %v", err)
            os.Exit(-1)
        }
    }
}

Basic configuration

The following is the basic Nginx configuration used in the experiments. Subsequent experiments will build on this base.

In this configuration, Nginx will have four worker processes listening on port 1994 and proxying to port 1995. Error logs will be sent to stderr, and access logs will be sent to stdout.

worker_processes 4;
daemon off;
error_log /dev/stderr warn;

events { worker_connections 10240; }

stream {
    log_format basic '[$time_local] '
                 'received: $bytes_received '
                 '$session_time';

    server {
        listen 1994 udp;
        access_log /dev/stdout basic;
        preread_by_lua_block {
            ngx.log(ngx.ERR, ngx.worker.id(), " ", ngx.var.remote_port)
        }
        proxy_pass 127.0.0.1:1995;
        proxy_timeout 10s;
    }

    server {
        listen 1995 udp;
        return "data";
    }
}

The output is as follows.

2023/01/27 18:00:59 [error] 6996#6996: *2 stream [lua] preread_by_lua(nginx.conf:48):2: 1 51933 while prereading client data, udp client: 127.0. 0.0.1, server: 0.0.0.0:1994
2023/01/27 18:00:59 [error] 6995#6995: *4 stream [lua] preread_by_lua(nginx.conf:48):2: 0 51933 while prereading client data, udp client: 127.0. 0.0.1, server: 0.0.0.0:1994
2023/01/27 18:00:59 [error] 6997#6997: *1 stream [lua] preread_by_lua(nginx.conf:48):2: 2 51933 while prereading client data, udp client: 127.0. 0.0.1, server: 0.0.0.0:1994
2023/01/27 18:00:59 [error] 6998#6998: *3 stream [lua] preread_by_lua(nginx.conf:48):2: 3 51933 while prereading client data, udp client: 127.0. 0.0.1, server: 0.0.0.0:1994
[27/Jan/2023:18:01:09 +0800] received: 28 10.010
[27/Jan/2023:18:01:09 +0800] received: 27 10.010
[27/Jan/2023:18:01:09 +0800] received: 23 10.010
[27/Jan/2023:18:01:09 +0800] received: 22 10.010

As you can see, all 100 UDP packets are spread out to each worker process. It seems that Nginx does not treat 100 packets from the same address as the same session. After all, each process reads UDP data.

reuseport

To have Nginx proxy UDP connections, you need to specify reuseport when you listen:

    ...
    server {
        listen 1994 udp reuseport;
        access_log /dev/stdout basic;

Now all UDP packets will fall on the same process and be counted as one session:

2023/01/27 18:02:39 [error] 7191#7191: *1 stream [lua] preread_by_lua(nginx.conf:48):2: 3 55453 while prereading client data, udp client: 127.0. 0.0.1, server: 0.0.0.0:1994
[27/Jan/2023:18:02:49 +0800] received: 100 10.010

When multiple processes are listening to the same address, if reuseport is set, Linux will decide which process to send to based on the hash of the quintuplet. This way, all packets inside the same UDP connection will fall to one process.

By the way, if you print the client address of the accepted UDP connection on the server on port 1995 (i.e., the address where Nginx communicates with the upstream), you will see that the address is the same for the same session. That is, when Nginx proxies to an upstream, it uses a UDP connection to pass the entire session by default.

proxy_xxx directives

As the reader has noticed, the start time of the UDP access recorded in the error log and the end time recorded in the access log are exactly 10 seconds apart. This time period corresponds to the proxy_timeout 10s; in the configuration. Since there is no hand waving in UDP connections, Nginx determines when a session terminates by default based on the timeout of each session. By default, the duration of a session is 10 minutes, except that I specifically assign 10 seconds due to my lack of patience.

Besides the timeout, what other conditions does Nginx rely on to determine session termination? Please read on.

        ...
        proxy_timeout 10s;
        proxy_responses 1;

After adding proxy_responses 1, the output looks like this.

2023/01/27 18:07:35 [error] 7552#7552: *1 stream [lua] preread_by_lua(nginx.conf:48):2: 2 36308 while prereading client data, udp client: 127.0. 0.0.1, server: 0.0.0.0:1994
[27/Jan/2023:18:07:35 +0800] received: 62 0.003
2023/01/27 18:07:35 [error] 7552#7552: *65 stream [lua] preread_by_lua(nginx.conf:48):2: 2 36308 while prereading client data, udp client: 127.0. 0.0.1, server: 0.0.0.0:1994
[27/Jan/2023:18:07:35 +0800] received: 9 0.000
2023/01/27 18:07:35 [error] 7552#7552: *76 stream [lua] preread_by_lua(nginx.conf:48):2: 2 36308 while prereading client data, udp client: 127.0. 0.0.1, server: 0.0.0.0:1994
[27/Jan/2023:18:07:35 +0800] received: 7 0.000
2023/01/27 18:07:35 [error] 7552#7552: *85 stream [lua] preread_by_lua(nginx.conf:48):2: 2 36308 while prereading client data, udp client: 127.0. 0.0.1, server: 0.0.0.0:1994
[27/Jan/2023:18:07:35 +0800] received: 3 0.000
2023/01/27 18:07:35 [error] 7552#7552: *90 stream [lua] preread_by_lua(nginx.conf:48):2: 2 36308 while prereading client data, udp client: 127.0. 0.0.1, server: 0.0.0.0:1994
[27/Jan/2023:18:07:35 +0800] received: 19 0.000

We see that Nginx no longer passively waits for a timeout, but terminates the session once it receives the packet from upstream. The relationship between proxy_timeout and proxy_responses is an "or" relationship.

As opposed to proxy_responses, there is a proxy_requests.

        ...
        proxy_timeout 10s;
        proxy_responses 1;
        proxy_requests 50;

After configuring proxy_requests 50, we see that the size of each request is stabilized at 50 UDP packets.

2023/01/27 18:08:55 [error] 7730#7730: *1 stream [lua] preread_by_lua(nginx.conf:48):2: 0 49881 while prereading client data, udp client: 127.0. 0.0.1, server: 0.0.0.0:1994
2023/01/27 18:08:55 [error] 7730#7730: *11 stream [lua] preread_by_lua(nginx.conf:48):2: 0 49881 while prereading client data, udp client: 127.0. 0.0.1, server: 0.0.0.0:1994
[27/Jan/2023:18:08:55 +0800] received: 50 0.002
[27/Jan/2023:18:08:55 +0800] received: 50 0.001

Note that the number of UDP upstream responses needed to get the session to terminate is proxy_requests * proxy_responses. In the example above, if we change proxy_responses to 2, it will take 10 seconds before the session terminates. Because after doing so, for every 50 UDP packets requested, 100 UDP packets need to be responded to before the session will be terminated, and each requested UDP packet will only get one UDP as a response, so we have to wait for the timeout.

Dynamic Proxy

Most of the time, the number of packets in a UDP request is not fixed, and we may have to determine the number of packets in a session based on a length field at the beginning, or determine when to end the current session by whether a packet has an eof flag in the header. Several of Nginx's proxy_* directives currently only support fixed values, and do not support dynamic settings with variables.

proxy_requests and proxy_responses actually just set the corresponding counters on the UDP session. So theoretically, we could modify Nginx to expose an API to dynamically adjust the value of the current UDP session's counters, enabling contextual determination of UDP request boundaries. Is there a solution to this problem without modifying Nginx?

Let's think about it another way. Can we read out all the client-side data via Lua and send it to the upstream from a cosocket at the Lua level? The idea of implementing an upstream proxy via Lua is really imaginative, but unfortunately it doesn't work at the moment.

Instead of the previous preread_by_lua_block, use the following code.

        content_by_lua_block {
            local sock = ngx.req.socket()
            while true do
                local data, err = sock:receive()

                if not data then
                    if err and err ~= "no more data" then
                        ngx.log(ngx.ERR, err)
                    end
                    return
                end
                ngx.log(ngx.WARN, "message received: ", data)
            end
        }
        proxy_timeout 10s;
        proxy_responses 1;
        proxy_requests 50;

We will see output like this:

2023/01/27 18:17:56 [warn] 8645#8645: *1 stream [lua] content_by_lua(nginx.conf:59):12: message received: H, udp client: 127.0.0.1, server: 0.0. 0.0.0:1994
[27/Jan/2023:18:17:56 +0800] received: 1 0.000
...

Since under UDP, ngx.req.socket:receive currently only supports reading the first packet, even if we set up a while true loop, we won't get all the client requests. Also, since content_by_lua overrides the proxy_* directive, Nginx does not use the proxy logic and assumes that there is only one packet for the current request. After changing content_by_lua to preread_by_lua, the proxy_* directive will take effect, but it will still not be able to proxy at the Lua level because it will not get all client requests.

Summary

If Nginx is proxying a single-packet UDP-based protocol like DNS, then using listen udp is sufficient. However, if you need to proxy UDP-based protocols that contain multiple packets, then you also need to add reuseport. In addition, Nginx does not yet support dynamically setting the size of each UDP session, so there is no way to accurately distinguish between different UDP sessions. the features that Nginx can use when proxying UDP protocols are more focused on those that do not require attention to individual UDP sessions, such as limiting flow.

ebpf 月报 - 2023 年 1 月

spacewander — Thu, 26 Jan 2023 12:08:26 +0000

本刊物旨在为中文用户提供及时、深入、有态度的 ebpf 资讯。

如果你吃了鸡蛋觉得好吃，还想认识下蛋的母鸡，欢迎关注：
笔者的 twitter：https://twitter.com/spacewanderlzx
笔者的 GitHub：https://github.com/spacewander

Merbridge 成为 CNCF sandbox 项目

Istio、Linkerd、Kuma 这三个项目除了都是 service mesh 之外，还有什么共同点？
它们都可以通过 Merbridge 加速！

Merbridge 是一个旨在通过 ebpf 代替 iptables，给 service mesh 加速的项目。作为成立刚满一周年的新项目，Merbridge 已经应用到 kuma 的官方版本当中。最近 Merbridge 又达到了另外一个里程碑 —— Merbridge 正式成为 CNCF sandbox 项目。

从 Merbridge 提交给 CNCF 的维护者列表来看，目前该项目背后是由 DaoCloud 推动的，半数维护者来自于该公司。不过由于 Merbridge 的位置偏向于底层组件，我们还是可以相信该项目的中立性。想必该项目的创立之初是为了加速 istio，后来也被 kuma 等非 istio 的 service mesh 所采纳。

在使用 Merbridge 进行一键加速之前，一个典型的 service mesh 的网络通信是这样的：

Merbridge 使用了 ebpf 的 SOCKMAP 功能，在 socket 层面上完成包的转发，绕过其他弯弯绕绕的路线。

在采用 Merbridge 之后，sidecar 和 app 之间的路径能够显著缩短：

在 Readme 上，Merbridge 把这种变化比喻成穿过了爱因斯坦-罗森桥（虫洞），倒是挺贴切。

SkyWalking Rover：使用 eBPF 提升 HTTP 可观测性 - L7 指标和追踪

https://skywalking.apache.org/zh/ebpf-enhanced-http-observability-l7-metrics-and-tracing/

提供及时的抓包信息一直是做接入层的程序员的刚需。在笔者的上上家公司，就有同事使用 pcap 实现了可以交互式抓包的后台服务。Elastic 公司也有个开源项目 packetbeat 支持通过 pcap 或者 af_packet 来常态化抓包。

只支持纯抓包的项目都有个限制，那就是无法跟用户态的更多信息有机结合起来。假设可以把用户态的上下文容纳入网络包的信息中，前景将会大大拓宽。比如通过比较用户态的读写方法和内核中实际的读写操作的时间差和数据量，用户会对应用中的 buffering 情况有更深入的了解。抑或通过 hook 应用中的 TLS 操作，可以得到未加密的真实的请求内容。

ebpf 填充了纯抓包和纯用户态的可观测性之间的鸿沟。通过 ebpf，用户能够同时在 kprobe 和 uprobe 中记录上下文，把两者紧密结合在一起。

言归正传，SkyWalking Rover 的这篇文章，强调了它对 L7 协议的可观测性。众所周知，作为一个跑在内核态里面、对执行方式有强约束的技术，想要在 ebpf 里面实现完整的 L7 协议栈是一件很困难的事。那么 SkyWalking Rover 是如何做到的？

SkyWalking Rover hook 了内核相关的函数，嗅探新连接的内容。它会读取最开始的一段内容做分析，猜测其背后采用的协议。虽然有的协议需要更加复杂的处理方式，比如嗅探 websocket 需要剥开外面的 HTTP1 的壳。不过对于绝大多数协议，这样就够了。在完成基本的筛选后，它会把内容转到用户态，交给用户态的解析程序来完成。用户态的解析程序会完成完整的 L7 协议解析。

Caretta：轻量级的 k8s 服务调用网络可视化工具

https://github.com/groundcover-com/caretta

作为一个去年 11 月才开始开发的项目，Caretta 在最近一个月的 star 增长得飞快，不得不让人佩服 groundcover 这家商业公司的宣发。

groundcover 是一家成立于 2021 年，基于 ebpf 做可观测性的以色列 APM 厂商。Caretta 并非是他们产品的开源版，而是该公司开源出来的小工具。Caretta 是一个轻量级的 k8s 服务调用网络可视化工具，能够梳理集群内不同应用之间的调用关系。这个工具通过 ebpf 获取 Node 上各个连接的信息，接着在用户态借助 k8s 的上下文把连接信息翻译成 k8s 的服务调用，然后通过 Prometheus 的标准接口把信息暴露出来，最后提供了 Grafana 报表展示 Prometheus 采集到的服务调用信息。毕竟是才开发了两个月的项目，这个工具在 ebpf 方面的逻辑其实并不复杂，比较酷炫的展示，是通过 Grafana 来实现的。

如何应对 eBPF 带来的新攻击方式

https://redcanary.com/blog/ebpf-malware/

ebpf 这么强大，一定会有人把它应用到黑产上。本文提到了一些借助 ebpf 进行不法行为的方式，并且给出若干加固的建议。
总结起来就两条：

defense in depth
如果不用，禁掉 ebpf 和/或 kprobe

对第一条展开说一下。由于 ebpf 能够跑在内核态的，所以通常需要 root 权限，或者 CAP_SYS_ADMIN / CAP_BPF（Linux 5.8 新加的）来跑。然而实际上非特权用户也能跑 ebpf，只是有些功能会被限制。感兴趣的读者可以搜索下“unprivileged ebpf”。但是，就像大家平时写的代码，内核代码中也难免不会出现 bug，导致非特权用户绕过限制的情况。

比如下面的博客就分析了之前一个绕过 Linux ebpf verifier 的安全漏洞：
https://stdnoerr.github.io/writeup/2022/08/21/eBPF-exploitation-(ft.-D-3CTF-d3bpf).html

所以考虑到非特权用户运行 ebpf 是如此鲜见，我们大可通过设置 kernel.unprivileged_bpf_disabled 禁用该功能。我检查了手上几个 Linux 设备，开发环境上 unprivileged ebpf 是启用的，而两台 VPS 上 unprivileged ebpf 都是禁用的。较新的发行版默认就是禁用 unprivileged ebpf，见 https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#unprivileged-bpf-disabled。

即使设置了 root 才能跑 ebpf，也不代表高枕无忧。黑客们可以通过别的手段拿到 root 权限，然后在 rootkit 里面植入 ebpf 程序。还有一种思路是采用供应链攻击，就像 log4j 一样。不过考虑到目前好像没有什么应用能够动态根据用户输入执行 ebpf 代码，而且 ebpf 也不能直接 import 别人的代码库，在这一块谈供应链攻击还尚早。

借助 ebpf 的 rootkit 是跑在内核里的，且许多 Linux 加固手段也是同样应用在内核里，看来彼此在内核中的斗争会是持久的攻防战。

Talking about go.sum

spacewander — Sun, 22 Jan 2023 04:03:18 +0000

As you know, Go creates two files when it does dependency management, go.mod and go.sum.
Compared to go.mod there is much less information about go.sum. Of course, the importance of go.mod cannot be overstated, as this file contains almost all of the information about dependency versions. And go.sum appears to be a go module build result rather than human readable data.
But in practice, we still have to deal with go.sum in our day-to-day development (usually to resolve merge conflicts caused by this file, or to try to manually adjust its contents). If you don't know go.sum, you can't always get it right just by scribbling it in from experience. Therefore, to get a better understanding of Go's dependency management, it is absolutely necessary to know the ins and outs of go.sum.
Since information about go.sum is so sparse (even the official Go documentation describes go.sum in a fragmented manner), I've spent some time compiling the relevant information in the hope that readers will benefit from it.

The format of go.sum

Each line of go.sum is an entry, roughly in the form of

<module> <version>/go.mod <hash>

<module> <version> <hash> or
<module> <version>/go.mod <hash> or

where module is the path of the dependency and version is the version number of the dependency. hash is a string starting with h1:, indicating that the algorithm used to generate the checksum is the first version of the hash algorithm (sha256).
Some projects don't actually have a go.mod file, so the Go documentation refers to this /go.mod checksum with the phrase "possibly synthesized". Presumably, for projects without go.mod, Go will try to generate a possible go.mod and take its checksum.
If there are only checksum for go.mod, this is probably because the corresponding dependencies are not downloaded separately. For example, a vendor-managed dependency will only have a checksum for go.mod.
The rules for determining version are complicated by the heavy historical baggage of go's dependency management. The whole process is like a questionnaire that requires answering one question after another.

First, is the project tagged?

If the project is not tagged, a version number will be generated, in the following format.
v0.0.0-commitDate-commitID

For example github.com/beorn7/perks v0.0.0–20180321164747–3a771d992973/go.mod h1:Dwedo/Wpr24TaqPxmxbtue+5NUziq4I4S80YR8gNf3Q=.

Referring to a specific branch of a project, such as develop branch, generates a similar version number:
vcurrentVersion-commitDate-commitID

For example github.com/DATA-DOG/go-sqlmock v1.3.4–0.20191205000432–012d92843b00 h1:Cnt/xQ9MO4BiAjZrVpl0BiqqtTJjXUkWhIqwuOCVtWo=.

Second, does the project use go module?

If the project uses go module, then it is normal to use tag as version number.

For example, github.com/DATA-DOG/go-sqlmock v1.3.3 h1:CWUqKXe0s8A2z6qCgkP4Kru7wC11YoAnoupUKFDnH08=.

If the project is tagged but does not use the go module, you need to add a +incompatible flag to distinguish it from a project that uses the go module.

For example, github.com/google/martian v2.1.0+incompatible/go.mod h1:9I4somxYTbIHy5NJKHRl3wXiIaQGbYVAs8BPL6v8lEs=

Third, is the go module version used in the project v2+?

For more information about the v2+ feature of go module, please refer to Go's official documentation: https://blog.golang.org/v2-go-modules. In simple terms, it is a way to distinguish different versions of dependencies in the same project by making the dependency paths suffixed with version numbers, similar to the effect of gopkg.in/xxx.v2.

For projects that use v2+ go module, the project path will have a version number suffix.

For example, github.com/googleapis/gax-go/v2 v2.0.5 h1:sjZBwGj9Jlw33ImPtvFviGYvseOtDM7hkSKB7+Tv3SM=

The benefits of go.sum

The reason why Go introduces a role like go.sum for dependency management is to achieve the following goals:

(1) provide package management dependency content validation in a distributed environment

Unlike other package management mechanisms, Go takes a distributed approach to package management. This means that there is a lack of a trusted center for verifying the consistency of each package.

In mainstream package management mechanisms, there is usually a central repository to ensure that the content of each release is not tampered with. For example, in pypi, even if a released version has a serious bug, the publisher cannot re-release the same version, only a new one. (But you can delete the released version or delete the whole project, refer to the leftpad accident of npm, so the mainstream package management mechanism is not strictly Append Only.)

And Go doesn't have a central repository. Even if the publisher is an honest person, the publishing platform can be evil. So we can only store the checksum of all the components we depend on in each project to ensure that each dependency will not be tampered with.

(2) As transparent log to enhance security

Another special feature of go.sum is that it not only records the checksum of the current dependency, but also keeps the checksum of every dependency in the history. This follows the concept of transparent log. The transparent log is designed to maintain an Append Only log to increase the cost of tampering and to facilitate review of which records have been tampered with. According to Proposal: Secure the Public Go Module Ecosystem, the reason why go.sum uses transparent log for each checksum in the history is to facilitate the work of sum db.

The downside of go.sum

Needless to say, go.sum also brings some troubles.

(1) easy to generate merge conflicts

I'm afraid this is the most criticized part of go.sum. Since many projects do not manage releases by tagging, each commit is equivalent to a new release, which leads to pulling their code and occasionally inserting a new record into the go.sum file. go.sum's ability to record indirect dependencies makes this situation even worse. The impact of this type of project can be significant - my rough count of lines in go.sum is about 40% of the total number of such records. For example, golang.org/x/sys has as many as 37 different versions in go.sum for one project.

If there were just an inexplicable number of lines, it would be frowned upon at best. In a scenario where multiple people are collaborating and several internal public libraries are used that are frequently versioned, go.sum can be a headache.

Imagine this scenario:

The public library turns out to have version A.
Developer A's branch a relies on public library version B, and developer B's branch b relies on public library version C. They each add records to go.sum as follows:

# branch a
common/lib A h1:xxx 
common/lib B h1:yyyy

# branch b
common/lib A h1:xxx 
common/lib C h1:zzzz

After that the public repository releases version D, which contains the features of version B and version C.
Then branch a and branch b are merged into the trunk, and that's when there is a merge conflict.

Now there are two options.

incorporate both intermediate versions into go.sum
choose neither b nor c, and just go with version d

Whichever method is used, manual intervention is required. This certainly brings unnecessary workload.

(2) Lack of constraint for third-party libraries that operate indiscriminately

The intention of go.sum is to provide a tamper-proof guarantee, so that if the actual content of a third-party library is found to be different from the recorded checksum value when pulling it, the build process will exit with an error. However, that's about all it can do. go.sum's detection feature puts more of a burden on the users of the library than on the developers of the library. In other package managers with a central repository, one can restrict the troublemakers at the source from changing the released version. But the constraints imposed by go.sum are purely ethical. If a library messes with a released version, it will make the project that depends on it fail to build. There seems to be no solution for the user of the library other than to curse, rebuke the author in an issue or elsewhere, and update the go.sum file. The author of the library is the one who made the mistake, but the user of the library is the one who is in trouble. This is not a very clever design. One possible solution would be to have the official mirroring of the various versions of well-known libraries. Although well-known repositories usually don't make the mistake of messing with released versions, if it happens (or if it happens due to some force majeure), at least there is a mirror available. However, this goes back to the path of a single central repository.

(3) In practice, manual editing of go.sum is inevitable.

For example, as cited earlier, edit the go.sum file to resolve merge conflicts. I have also seen projects that only keep the latest version of the checksum of dependencies in go.sum. If go.sum is not fully managed by the tool, how can you guarantee that it is Append Only? If go.sum is not Append Only, how can you use it as a transparent log?

round	A deadline	B deadline	A last_popped_time	B last_popped_time	Selected
1	1/5	1/2	0	0	A
2	2/5	1/2	1/5	0	A
3	3/5	1/2	2/5	0	B
4	3/5	1	2/5	1/2	A
5	4/5	1	3/5	1/2	A
6	1	1	4/5	1/2	B
7	6/5	1	4/5	1	A

round	A deadline	B deadline	A last_popped_time	B last_popped_time	Selected
1	1/5	1/2	0	0	A
2	2/5	1/2	1/5	0	A
3	3/5	1/2	2/5	0	B
4	3/5	1	2/5	1/2	A
5	4/5	1	3/5	1/2	A
6	1	1	4/5	1/2	B
7	6/5	1	4/5	1	A

round	A deadline	B deadline	A last_popped_time	B last_popped_time	Selected
1	1/5	1/2	0	0	A
2	2/5	1/2	1/5	0	A
3	3/5	1/2	2/5	0	B
4	3/5	1	2/5	1/2	A
5	4/5	1	3/5	1/2	A
6	1	1	4/5	1/2	B
7	6/5	1	4/5	1	A