DEV Community

spacewander
spacewander

Posted on

Uncounted Tokens: The Game of Attack and Defense in AI Gateway Rate Limiting

Attack

AI gateways typically feature a specific function: performing rate limiting based on token consumption. In some contexts, this is called ai-rate-limiting, while in others, it is known as ai-quota. Regardless of the name, the principle remains the same: it relies on the token usage information returned at the end of an inference request.

The method to bypass these restrictions becomes evident: one simply needs to find a way to prevent the gateway from seeing the usage information at the end of the inference request. Sometimes, users can bypass these limits unintentionally. For example, in the OpenAI chat interface, usage is not returned by default during streaming unless the user explicitly specifies include_usage in the request: https://platform.openai.com/docs/api-reference/chat/create#chat_create-stream_options-include_usage.

Suppose the model provider always provides token usage, or the gateway performs a "hack" when processing user requests to inject include_usage, ensuring that token usage is always present at the end of an inference request. What then?

There is still a way: force the inference request to terminate early. We can insert a specific prompt that specifies the output of a "stop word" upon completion, followed by the execution of a time-consuming task. When the client receives this stop word, it can safely disconnect. As long as the request is interrupted prematurely, the gateway will not maintain the connection with the upstream provider and, naturally, will not receive the final usage report sent by the upstream.

Of course, certain configurations can modify this behavior, such as Nginx's proxy_ignore_client_abort. However, doing so carries a risk: if a legitimate client wants to terminate inference early, but the gateway continues communicating with the upstream, it could result in the user being overcharged.

While these little tricks can fool middleware, the inference engine side still knows exactly how many input tokens were received during prefill and how many output tokens were generated during decode. Therefore, the final bill sent to you will still be correct.

Defense

The various attack methods mentioned above essentially reveal a structural pain point in current AI gateway architectures regarding streaming scenarios: asynchronous billing. In the traditional Request-Response model, gateways can easily intercept and calculate traffic. However, in LLM streaming interactions, token consumption is a dynamic process that accumulates over time, and accurate token usage reporting often lags behind the end of the request. As long as the gateway relies on "post-event" reporting, clients have the opportunity to use disconnection strategies to create a "billing black hole."

So, is there a reliable way to calculate the actual token usage during communication without relying on the token usage information in the inference request?

The simplest and crudest method is to multiply the number of bytes by a "magic number" coefficient to serve as a fallback when token usage cannot be found. If you can turn a blind eye to accuracy, this is the solution with the lowest overhead.

The "official" approach is to call the model provider's own count token interface. For open-source inference engines like vLLM or TensorRT-LLM, there are corresponding tokenize interfaces. However, requiring the gateway to initiate multiple extra HTTP calls for every request is costly, especially when streaming responses.

Some coding libraries provide local tokenization capabilities, such as:

  • huggingface/tokenizers
  • openai/tiktoken and its Go port: pkoukk/tiktoken-go

However, these tokenizers need to know the model's tokenizer configuration to work, and model providers are highly unlikely to publish this data. That said, there are open-source versions of these private models on the market, such as Gemma for Gemini. It is unclear how much difference there is between the tokenizer configurations of these open-source versions and the private ones, or whether the results based on their configurations approximate the official count token interface.

If you are using a self-hosted model, theoretically, having the tokenizer configuration allows you to perform tokenization locally without relying on a remote tokenizer service.

Assuming token usage is not provided locally but relies on remote results, for the sake of caution, it is best to add a rate limit based on request count (or byte count, if available) alongside the token-based limit. This ensures that even if the remote end fails to return token usage, the system is not left completely undefended.

Top comments (0)