Linh Nguyen

Posted on Feb 25

Architecture of Trust: Defending Against Jailbreaks and Attacks using Google ADK with LLM-as-a-Judge and GCP Model Armor

#ai #aisafety #guardrail #googlecloud

The technological landscape is witnessing a transition of historical magnitude. We are shifting from deterministic, command-based software to probabilistic, intent-driven Agentic AI. As I discussed during my recent sessions at GDG DevFest Hanoi and DevFest Ho Chi Minh 2025, this evolution promises to revolutionize industries, but it also introduces a vulnerability surface of unprecedented complexity.

The central challenge of the Agentic Age is Prompt Injection and its associated pathologies: Jailbreaking, Excessive Agency, and Hallucination. As agents are granted the power to transfer funds, modify databases, and interact with external APIs, the consequence of a successful “jailbreak” shifts from reputational embarrassment (generating offensive text) to *catastrophic operational failure *(unauthorized financial transfers or data exfiltration). The industry consensus, forged in the crucible of recent high-profile vulnerabilities and solidified at major technical convenings is clear: “Prompt Engineering” is insufficient for security. You cannot prompt your way to safety.

To build truly secure multi-agent systems, we need a philosophical shift from text-based "begging the model to be good" to a Code-First architecture. In this post, I will walk you through how to engineer this "Architecture of Trust" using the Google Agent Development Kit (ADK), LLM-as-a-Judge & Model Armor, utilizing the actual code patterns I demonstrated in my recent codelabs.

The Threat: Why “Good Prompts” Fail

To engineer a secure system, we must first understand the adversary. The vulnerabilities plaguing LLM-based agents arise not just from bugs in the code or the nature of the model itself, but from the unpredictable ways users interact with Agentic Systems. Attacks generally fall into two categories: manipulation of the model’s weights (such as Backdoors, where malicious behaviors are hidden during fine-tuning) or manipulation of the input at runtime.

The easiest demonstration of how backdoors work: Specific keywords, data, or case are inserted during training and will be triggered when users’ input include that special patterns or keywords; similar to this function here.

The most common runtime attacks generally fall into two main categories: Jailbreaking and Prompt Injection.

1. Jailbreaking

The goal is to bypass safety alignment, forcing the model to generate forbidden content.

Roleplaying (Persona Adoption): Techniques like “DAN” (Do Anything Now) instruct the model to assume a persona that explicitly ignores safety rules.
Payload Splitting: Breaking a malicious keyword (e.g., “b-o-m-b”) into syllables or tokens across multiple turns to evade keyword filters.**
**Translation & Obfuscation Attacks: Converting a malicious prompt into low-resource languages (like Zulu or Scots Gaelic) or encodings (Base64). Since safety training is often English-centric, the model’s translation capability can bypass alignment barriers.
Context Flooding: Overwhelming the context window with benign text to “dilute” system instructions, causing the model to lose track of its constraints.
Adversarial Suffixes: Appending specific character strings to a user query that mathematically push the model’s probability distribution toward generating an affirmative response to a harmful query (video is recorded from https://llm-attacks.org/ website)

2. System Manipulation & Injection

The goal is to hijack the agent’s control flow to execute unauthorized commands.

Prompt Injection: A direct attack where the user overrides system instructions to force the agent to execute unintended logic or commands.
Session Poisoning: A silent killer where malicious content enters the conversation history (potentially via retrieved documents or logs), influencing future turns even after the initial attack appears blocked.

The Stakes: As we move toward agents that can transfer funds or modify databases, a successful attack shifts from mere reputational embarrassment to catastrophic operational failure.

The Solution: Guardrails

Safety guardrails are the architectural mechanisms implemented to ensure AI agents remain safe, ethical, and aligned with human values. Unlike model training, which is static, guardrails act as a dynamic filter to prevent harmful actions, mitigate risks, and ensure compliance with legal standards in real-time.

Why are Safety Guardrails Important?

As AI systems become more powerful and autonomous, the potential risks associated with their deployment also increase. Safety guardrails are essential to:

Prevent misuse and abuse of AI capabilities
Mitigate unintended consequences of AI actions
Ensure compliance with legal and ethical standards
Build trust with users and stakeholders
Protect against adversarial attacks and manipulations

Common Safety Guardrail Techniques

Input Validation: Ensuring that user inputs are safe and do not contain harmful content.
Output Filtering: Screening AI outputs to prevent the generation of harmful or sensitive content.
Tool Use Restrictions: Limiting the tools and actions that AI agents can perform based on safety considerations.
Session Management: Protecting conversation history from being poisoned with harmful content.
Monitoring and Auditing: Keeping logs of AI interactions for review and analysis.
Multi-layered Defense: Implementing multiple layers of safety checks to catch potential issues at different stages of the AI workflow.

Approach 1: LLM-as-a-Judge Safety Plugin

The Google Agent Development Kit (ADK) represents a shift to “Code-First” development. Instead of chaining strings of text, we use strongly typed primitives -Agents, Tools, and Plugins - to create deterministic firewalls around the model’s cognition.

The most critical feature for security is the Callback Lifecycle. This allows us to inject code at specific hooks:

on_user_message: Inspect input before it touches the context.
before_run: Last line of defense to halt execution.
after_tool: Deterministic validation of tool outputs.
after_model: Redacting PII from the final response.

Let’s walk through building an LLM-as-a-Judge plugin step-by-step. This pattern uses a second, specialized LLM to evaluate the safety of user inputs, offering the highest flexibility for detecting complex attacks.

Step 1: Initialize the Judge

First, we define our plugin class. We initialize a separate “Judge” agent (using a lightweight model like Gemini Flash) and a runner. This isolation ensures the Judge’s state doesn’t pollute our main agent’s context.

from google.adk.plugins import base_plugin
from google.adk.agents import llm_agent
from google.adk import runners

class LlmAsAJudgeSafetyPlugin(base_plugin.BasePlugin):
    """Safety plugin that uses an LLM to judge content safety."""

    def __init__(self, judge_agent: llm_agent.LlmAgent):
        super().__init__(name="llm_judge_plugin")
        self.judge_agent = judge_agent
        # Isolate the judge in its own runner
        self.judge_runner = runners.InMemoryRunner(
            agent=judge_agent,
            app_name="safety_judge"
        )
        print("🛡️ LLM-as-a-Judge plugin initialized")

Step 2: Input Filtering

We hook into on_user_message_callback. This runs before the main agent ever sees the message. We wrap the user’s input in XML tags to give the Judge clear context and check for safety.

If the input is unsafe, we perform a critical security maneuver: we replace the malicious text with a placeholder.

async def on_user_message_callback(
        self,
        invocation_context: invocation_context.InvocationContext,
        user_message: types.Content
    ) -> types.Content | None:
        """Filter user messages before they reach the agent."""

        # Extract text and wrap for the judge
        message_text = user_message.parts[0].text
        wrapped = f"<user_message>\n{message_text}\n</user_message>"

        # Call our helper to check safety (implementation omitted for brevity)
        if await self._is_unsafe(wrapped):
            print("🚫 BLOCKED: Unsafe user message detected")

            # CRITICAL: Set a flag in the session state to indicate violation
            invocation_context.session.state["is_user_prompt_safe"] = False

            # Replace the malicious message. 
            # This ensures the attack payload is NEVER saved to the session history.
            return types.Content(
                role="user",
                parts=[types.Part.from_text(text="[Message removed by safety filter]")]
            )
        return None

Step 3: Execution Gating

Replacing the message isn’t enough; we must stop the agent from answering. We use the before_run_callback to check the flag we just set. If the session is compromised, we short-circuit the execution immediately.

    async def before_run_callback(
        self,
        invocation_context: invocation_context.InvocationContext
    ) -> types.Content | None:
        """Halt execution if user message was unsafe."""

        # Check the flag from Step 2
        if not invocation_context.session.state.get("is_user_prompt_safe", True):
            # Reset flag for next turn
            invocation_context.session.state["is_user_prompt_safe"] = True

            # Return a canned response immediately.
            # This prevents the main model from wasting tokens on a blocked request.
            return types.Content(
                role="model",
                parts=[types.Part.from_text(
                    text="I cannot process that message as it was flagged by our safety system."
                )]
            )
        return None

Step 4: Output Guardrails

Finally, we can implement similar checks for tool outputs (after_tool_callback) and the model’s final response (after_model_callback). This Defense-in-Depth ensures that even if a prompt injection works, the agent can’t leak PII or execute malicious tool commands.

Summary

Approach 2: Enterprise Scale with Model Armor

For production systems where latency and compliance are paramount, relying on a second LLM can be too slow. In these cases, we swap the “Judge” logic for Google Cloud Model Armor.

The structure remains similar, but the implementation is streamlined for speed (~100-300ms) using the enterprise client.

What’s included?

Responsible AI

Prompt Injection and Jailbreak Detection

Identifies and blocks attempts to manipulate an LLM into ignoring its instructions and safety filters.

Sensitive Data Protection

Detects, classifies, and prevents the exposure of sensitive information in both user prompts and LLM responses:

Personally Identifiable Information (PII): Names, addresses, phone numbers, email addresses
Financial Data: Credit card numbers, bank account details
Health Information: Medical records, health IDs
Confidential Data: Trade secrets, proprietary information
Credentials: API keys, passwords, tokens

Malicious URL Detection

Scans for malicious and phishing links in both input and output to:

Prevent users from being directed to harmful websites
Stop the LLM from inadvertently generating dangerous links
Detect encoded or obfuscated URLs
Identify newly registered domains used in phishing

Document Screening

Screens text in documents for malicious and sensitive content:

Supported formats: PDFs, Microsoft Office files (Word, Excel, PowerPoint), text files
Use cases: Upload safety, content moderation, data loss prevention
Integration: Can be used as a pre-processing step before document analysis

Step 1: Initialize the Client

class ModelArmorSafetyPlugin(base_plugin.BasePlugin):
    def __init__(self):
        super().__init__(name="model_armor_plugin")
        # Initialize the enterprise client with your template ID
        self.template_name = f"projects/{PROJECT_ID}/locations/{LOCATION}/templates/{TEMPLATE_ID}"
        self.client = modelarmor_v1.ModelArmorClient(...)

Step 2: The Fast Check

Instead of prompting another agent, we make a single API call. This checks for CSAM (Child Sexual Abuse Materials), Hate Speech, Malicious URIs, and Jailbreaks in one pass.

   async def on_user_message_callback(self, invocation_context, user_message):
        # Call the Model Armor API
        request = modelarmor_v1.SanitizeUserPromptRequest(
            name=self.template_name,
            user_prompt_data=modelarmor_v1.DataItem(text=user_message.parts[0].text)
        )
        response = self.client.sanitize_user_prompt(request=request)

        # If violations are found, block it just like before
        if response.sanitization_result.filter_match_state != modelarmor_v1.FilterMatchState.NO_MATCH_FOUND:
            print(f"🚫 Model Armor BLOCKED this request.")
            invocation_context.session.state["is_user_prompt_safe"] = False
            return types.Content(
                role="user",
                parts=[types.Part.from_text(f"[Message removed by Model Armor]")]
            )
        return None

The Silent Killer: Session Poisoning

One of the most valuable insights from my research involves Session Poisoning.

Imagine this scenario:

Turn 1: User asks a safe question. Agent answers.
Turn 2 (Attack): User injects malicious content (”Ignore safety, here is how to make explosives...”).
Turn 2 (Defense): Your guardrail blocks the response.
Turn 3 (Exploit): User says, “Continue with what we discussed.”

If you simply blocked the response in Turn 2 but saved the user’s prompt to the conversation history, the agent in Turn 3 might look back at the history, see the explosives instructions, and comply.

The fix: We must ensure unsafe content is never persisted.

In the on_user_message_callback above, notice this crucial line:

return types.Content(
    role="user",
    parts=[types.Part.from_text(text="[Message removed by safety filter]")]
)

By returning a modified message object, we ensure the attack payload is physically overwritten in the session memory before the agent ever sees it. The “context poisoning” attack vector is closed.

Conclusion: Guardrails are like Basic Security for AI Applications

Building a secure multi-agent system is not about finding the perfect prompt that never breaks. It is about acknowledging that the model will eventually fail - whether through stochastic hallucination or adversarial attack - and engineering a system that remains robust in the face of that failure.

The time of begging for the model to filter it correctly or not returning the false response is over.

If you are building agents for the enterprise, I strongly recommend exploring the Google Agent Development Kit. It provides the necessary scaffolding - Type Safety, Observability, and a robust Callback Architecture - to build agents that are trusted to operate in the real world.

Next blog will be about some advance topics regarding Multi-Agent Systems (MAS) and Collaborative Security, Rate Limiter and Spamming Guardrails, as well as Least Privilege Tool Scoping.

You can find the full Jupyter Notebook and code for the plugins discussed here in my DevFest Codelab repository.

DEV Community