DEV Community

spacewander
spacewander

Posted on

Beyond Simple Forwarding – Practical Content Safety in AI Gateways

(This article only discusses content safety for text generation, not multimodal.)

Connecting AI inputs/outputs to a content‑safety filtering system is almost a must‑have feature for any AI gateway. For compliance, on the one hand, personal information in the context needs to be desensitized; on the other hand, certain inappropriate statements need to be sanitized. Most content‑safety filtering systems on the market work similarly: they take a piece of text and return processing results (whether to filter, which rules were violated, what text needs to be replaced, etc.). In fact, an AI gateway can have a dedicated content‑safety subsystem placed on the proxy path. Different content‑safety vendors are just different providers for this subsystem; only the integration formats and configs differ, while the basic I/O can be reused.

Input

All LLM providers take JSON as input, so at the input stage you generally parse JSON and extract provider‑specific input fields.

Before we dive in, let me briefly revisit the structure of a chat interface. A chat request looks like this:

system prompt # Optional, built‑in assumptions
---
user prompt   # Usually the user’s input
---
response to user prompt
---
user's next prompt
...
Enter fullscreen mode Exit fullscreen mode

Some AI gateway products, by default, only inspect the latest prompt (in some products, that’s even the only supported behavior). This is actually not safe enough.

For untrusted clients (for example, software running on users’ own machines), the entire conversation history is supplied by the attacker, so they can tamper with previous content. The same logic applies if you only check user prompts or only check content other than the system prompt.

What if all calls come from trusted clients — for example, a backend service that always appends user inputs to the end of the conversation? Is it then safe to only inspect the latest prompt? Unfortunately, no. When the model performs inference, it does not only look at the user prompt or the newest prompt; it looks at the whole context. If the content‑safety filter only inspects the latest prompt, its field of view is too narrow and it can’t understand the context. For example:

Assume your content‑safety rule disallows discussing the politics of a certain region.

> A certain region, you know which

< Content not displayed due to relevant laws and regulations

> Tell me about the politics of the region mentioned earlier
Enter fullscreen mode Exit fullscreen mode

If the content‑safety system can only see the most recent message, it has no way of knowing which “region mentioned earlier” is being referred to, and thus can’t block the last request. Of course, developers can remove blocked content from the user’s message history to ensure safety. If your features depend on content filtering, it’s important to understand the boundaries of what this filtering can do.

Output

There are two forms of model text generation: streaming and non‑streaming. Suppose a service uses streaming responses and wants to integrate a content‑safety filtering system. If we simply convert it to non‑streaming (wait for all content to be generated, then call the content‑safety system), we may affect the service. For example, originally the user can see content being generated piece by piece; even if full generation takes a few minutes, they won’t feel impatient. After switching to non‑streaming, however, users have to wait several minutes to see any result at all, and might switch to a competitor instead. So why not just feed the streaming response directly into the content‑safety system? Because unsafe content might happen to be split right across two streaming chunks.

Is there a compromise? Yes — by introducing a delay buffer.

The core idea is: during a streaming response, maintain a buffer that stores the most recently generated content. When the buffer hits a certain size, or times out, or the request ends, you call the content‑safety system to check it. If no unsafe content is found, send everything in the buffer except for the last few characters to the user. Keep those trailing characters in the buffer to guard against unsafe content that spans chunk boundaries; they’ll be processed on the next check. This approach preserves content safety while minimizing the impact on user experience. The underlying intuition is that unsafe content is typically local; it’s not like reading an O. Henry story where you only get a twist at the very end. As long as you retain and check a portion of the most recently generated content, you can effectively catch unsafe content. Specifically:

  1. Receive chunk 1: xxx...xbad c
  2. Run safety check on xxx...xbad c, passes
  3. Send chunk 1: xxx... to the user, keep the trailing "xbad c" in the buffer
  4. Receive chunk 2: ontent...yyy
  5. Concatenate buffer and chunk 2 to get "xbad content...yyy"
  6. Run safety check on xbad content...yyy and discover “bad content” is unsafe

The key is to choose an appropriate buffer size that can catch unsafe content that crosses chunk boundaries, without making the user wait too long. By adjusting buffer size and check frequency, you can strike a balance between content safety and user experience.

Even if you ignore business impact, a buffer is still necessary. Content‑safety systems have an upper limit on the number of characters they can process per request. If the streaming response is too large and you send it directly to the content‑safety system, you might exceed its processing capacity. With a buffer, you can split a long streaming response into multiple smaller segments, send them for checking one by one, and avoid exceeding the system’s capacity.

Conclusion

In content‑safety design for generative text, what looks like a simple “forward everything to a filter” actually hides quite a bit of nuance.

On the input side: don’t only inspect the latest prompt — full context is what the model bases its decisions on.

On the output side: introducing a buffer and retaining the last few characters for segmented checks is a pragmatic way to balance user experience and safety; at the same time you must tune buffer size, check frequency, and timeout strategy.

Top comments (0)