How I built a "Gatekeeper" for AI Agents (And why prompt filtering isn't enough)

#ai #opensource #security #automation

We spend a lot of time securing the inputs to our LLMs—filtering prompts, checking for injections.

But in the world of AI Agents, we have a new blind spot: Tool Outputs.

When an agent calls get_jira_ticket, the response often contains a dump of raw text. In my case, that text contained user emails and internal secrets.

If I logged that context window to an observability tool, I was essentially persisting secrets in a dashboard.

So, I built QuiGuard to solve this. Here is how it works under the hood.

The Architecture
I didn't want to rewrite the agent frameworks (LangChain/AutoGen). I needed something that sat transparently in the middle.

The solution was a Reverse Proxy.

Interception: The proxy accepts the OpenAI-compatible API request.
Traversal: It recursively walks through the messages array.
The Gatekeeper Logic: If it sees a message with role: "tool", it knows this is data coming back from an API.
The Challenge: Recursive JSON
Tool responses aren't always clean strings. Sometimes they are stringified JSON inside JSON.

To handle this, I wrote a recursive scrubber:

def _recursive_scrub(data):
    if isinstance(data, dict):
        return {k: _recursive_scrub(v) for k, v in data.items()}
    elif isinstance(data, list):
        return [_recursive_scrub(item) for item in data]
    elif isinstance(data, str):
        # It's a string. Is it stringified JSON? Try to parse.
        try:
            nested_data = json.loads(data)
            scrubbed_nested = _recursive_scrub(nested_data)
            return json.dumps(scrubbed_nested)
        except json.JSONDecodeError:
            # Not JSON, just a normal string. Scrub PII.
            return sanitize_text(data)
    else:
        return data

This ensures that even if a tool returns {"body": "{\"user\": \"secret@...\"}"}, we catch the secret.

The Result
Clean Logs: My LangSmith traces now show instead of real emails.
Safe Context: The LLM processes the logic without "seeing" the sensitive data.
Restoration: The user sees the real data in the final reply.
I open-sourced the project (MIT).

Repo: https://github.com/somegg90-blip/quiguard-gateway

Curious if others have run into the "messy tool output" problem? Let me know in the comments!