Olawale Adepoju for AWS Community Builders

Posted on Jan 8

Detecting and Filtering Harmful Content with Amazon Bedrock Guardrails

#aws #awsguardrail #ai #amazonbedrock

Technical Overview

Amazon Bedrock Guardrails provide a centralized control layer that sits between your application and the foundation models (FMs) used to generate responses. Guardrails allow you to define enforceable safety, privacy, and compliance rules that are applied consistently—regardless of which model is used underneath.

From an architecture perspective, guardrails are evaluated on both inbound prompts and outbound responses, ensuring that unsafe content is blocked or transformed before it reaches the model or the end user.

High-Level Architecture Flow

User Request Enters the Application
A user interacts with the application (for example, a chatbot, banking portal, or call center system). The request is passed to the application backend through an API or UI layer.
Prompt Evaluation via Bedrock Guardrails
Before the request is sent to a foundation model, the application invokes Amazon Bedrock with an associated guardrail configuration.
At this stage, guardrails inspect the user prompt for:

Harmful or toxic language
Disallowed topics (such as financial or legal advice)
Sensitive data patterns (PII, depending on configuration)

If the prompt violates defined policies, Bedrock can:

Block the request
Return a predefined safe response
Log the event for auditing and monitoring

Model Invocation (If Prompt Is Allowed)
Only prompts that pass guardrail evaluation are forwarded to the selected foundation model (for example, Claude, Titan, or other Bedrock-supported models).
This decouples safety logic from the model itself and ensures consistent behavior even when models are swapped or upgraded.
Response Evaluation via Guardrails
After the foundation model generates a response, guardrails are applied again—this time on the model output.
Guardrails can:

Detect and block toxic or unsafe responses
Prevent disallowed advice or policy violations
Redact or mask personally identifiable information (PII)

Final Response Returned to the User Only responses that comply with guardrail rules are returned to the application and displayed to the user. If the response violates policies, a controlled fallback message is returned instead.

Example Architecture Use Cases

Chatbot Architecture
Guardrails validate user input before inference and scan model output after inference to ensure no abusive or harmful content is surfaced to users.
Financial Services Architecture
Guardrails act as a policy enforcement layer that blocks prompts or responses related to investment advice, reducing regulatory risk while still allowing general financial information.
Contact Center Summarization Pipeline
Conversation transcripts are sent through Bedrock with guardrails configured to detect and redact PII before summaries are stored in downstream systems such as S3, OpenSearch, or CRM platforms.

Why This Architecture Matters

By separating safety controls from application logic and model selection, Amazon Bedrock Guardrails enable:

Centralized governance across multiple AI workloads
Model-agnostic safety enforcement
Easier auditing, compliance, and policy updates without code changes

This approach allows teams to scale generative AI applications while maintaining predictable, controlled, and compliant behavior across environments.

Amazon Bedrock Guardrails Policies and Enforcement Capabilities

Amazon Bedrock Guardrails provide a set of configurable safeguards (referred to as policies) that are evaluated during prompt processing and model inference. These policies allow teams to detect, block, redact, or validate content before it reaches a foundation model and again before a response is returned to the user.

Each policy type can be enabled independently and tuned to match application-specific risk tolerance.

Content Filters

Content filters are used to detect and block harmful text or image content in user prompts and model responses.

Guardrails classify content into predefined categories:

Hate
Insults
Sexual
Violence
Misconduct
Prompt Attacks (jailbreak attempts)

For each category, you can configure the filter strength (for example, permissive vs. strict), allowing fine-grained control based on the sensitivity of your application.

Both Classic and Standard tiers support these categories.
With the Standard tier, content detection is extended into code-level elements, including:

Comments
Variable and function names
String literals

This is especially important for developer tools, code assistants, and AI-generated scripts.

Denied Topics

Denied topics allow you to explicitly define subjects that are out of scope or not allowed for your application.

If a denied topic is detected in either:

The user query, or
The model’s response

the request can be blocked or replaced with a safe fallback message.

In the Standard tier, denied topic detection also applies inside code elements such as comments, variables, function names, and strings—preventing policy violations from being hidden in generated code.

This is commonly used in regulated environments (for example, blocking medical or investment advice).

Word Filters

Word filters allow exact-match blocking of specific:

Words
Phrases
Profanity

This is useful for enforcing business-specific restrictions, such as:

Offensive language
Competitor names
Brand misuse

Word filters are deterministic and operate as a straightforward enforcement layer within the guardrail evaluation process.

Sensitive Information Filters

Sensitive information filters help detect and block or mask personally identifiable information (PII) in both prompts and responses.

Detection is probabilistic and supports standard formats for entities such as:

Social Security Numbers
Dates of birth
Addresses

In addition to built-in PII detection, you can configure custom regular expressions to identify organization-specific identifiers, such as customer IDs or internal reference numbers.

This policy is critical for applications that store outputs in downstream systems like S3, OpenSearch, CRMs, or analytics platforms.

Policy Violation Handling

In addition to defining policies, you can configure custom user-facing messages that are returned when:

A user input violates a policy, or
A model response fails guardrail evaluation

This allows applications to fail safely and consistently, rather than returning generic errors or silent failures.

Integration Options in the Architecture

Guardrails can be used in two primary ways:

1.　During Model Inference
Guardrails are applied by specifying the guardrail ID and version during the Bedrock inference API call.
In this mode, guardrails evaluate both:

Input prompts
Model completions

Standalone Guardrail Evaluation Using the ApplyGuardrail API, guardrails can be applied without invoking a foundation model. This is useful for:

Pre-validating user input
Post-processing outputs from external systems
Enforcing policies in RAG pipelines before inference

For RAG and Conversational Applications

In RAG or multi-turn conversational architectures, you may want to evaluate only the user’s current input, while excluding:

System instructions
Retrieved search results
Conversation history
Few-shot examples

This approach ensures that guardrails focus on user intent, rather than falsely flagging internal context or system-generated content.

DEV Community