This week I'm back with another "learn with me" on a topic that consistently weighs heavily on me: AI safety.
It's no doubt the world of AI has changed drastically since the days of AlexNet, ResNet and similar pioneering results. I don't remember exactly what I was doing in November of 2022, but I can certainly remember the hype surrounding the public release of then newcomer ChatGPT.
As the number of training parameters for foundation models increased from millions to billions and now trillions, the question remained- how do we responsibly use these models and ensure they're not creating more downstream misinformation for the general public?
There are numerous ways to tackle this topic of course, but today I'd like to focus on how developers can use Amazon Bedrock Guardrails to mitigate some of these challenges. This is my first time getting hands on with Guardrails since they were in preview back in 2023, so let's see what's changed π€πΌ
The what and the how
Amazon Bedrock Guardrails is exactly what it sounds like, guardrails for models on Bedrock that can help across six safeguard policies.
Multimodal content filters
Denied topics
Sensitive information filters
Word filters
Contextual grounding checks
Automated reasoning (in preview)
I won't go in depth into all of the policies here, but you can read more about each category in the documentation.
These guardrails can be customized to help organizations maintain strong security controls and promote the responsible use of AI, a task especially important in a time where misinformation and model hallucination can lead to serious downstream effects if gone unchecked.
Enter DeepSeek on Bedrock
With the announcement of managed serverless DeepSeek-R1 available in Bedrock just two weeks ago, I figured why not test out guardrails on a model still warm off the press ποΈ
In this write up I'm assuming you're familiar with DeepSeek and its benefits and use cases, but if that's not the case I encourage you to familiarize yourself.
Setup
I'm keeping it simple this week and opting for a run through on the AWS Console as I experiment with this, but there's plenty that can be done programmatically as well (and maybe in a future post π). Given this is a "learn with me" I want to see how long it takes me to get operational and share any other insights as I walk through this.
To kickoff the guardrails setup we first need to create one. Head over to the Guardrails console to follow along.
-
Click Create guardrail and start with a name, description and message that displays when the guardrail blocks a prompt or response.
-
In the next screen we're presented with the option to configure content filters, one for Harmful categories and another for Prompt attacks.
Interesting note here, there is a new Image column in the harmful categories that supports certain models' multimodal capabilities which is currently in preview. I found this AWS blog from Dec 2024 helpful on the topic if you want to dive further on image support.
Harmful categories feels self explanatory, but I personally haven't heard much about Prompt attacks. The description field here tells us enabling this allows the guardrail to detect and block a user input that attempts to override system instructions. That makes sense to me, I think by now we've probably all seen some example of prompt engineering that performs the "Ignore the previous instructions and..." exploitation that was rampant in the early days of foundation model use.
I enable both by toggling the button filters and will test these out as I go on.
-
In the next screen we have the option to add up to 30 denied topics which will block both user input and model response around the chosen topic. In this section I'll be testing having health care as a topic I want the model to avoid discussing. I define the behavior I expect and you can also add sample phrases that help the guardrail learn representative phrases to look for.
-
We also have the option to try word filters including a default profanity option, I enable that. For fun here I add random words to the custom section to see how the model performs during testing. There's also an option to enable PII filters with default values (name, email, etc) or the ability to add your own. I select email here with the option to mask it so that I'm able to run that test as well.
I review the configuration and finalize by clicking Create guardrail.
Testing
With our guardrail now created I'm using the AWS Console to validate functionality across all the filters I enabled.
-
Select Test from the within your created guardrail. Here we can select DeepSeek-R1 for inference.
-
First I want to check that the harmful content category is working.
a. I try a prompt and see that a guardrail intervened and can click View trace to understand which guardrail was activated and details about the guardrails confidence.
-
Next I want to try coaxing the model to break its predefined rules. We see in this case 2 guardrails were triggered, one for misconduct and another for prompt attacking.
-
I had added a denied topic for health care, so let's see if I can activate that guardrail with a prompt asking for medical advice. While I supplied phrases to the guardrail during setup, I'm careful to avoid using those exact ones in my prompt to see if the model can reason past the obvious.
-
The final test for my guardrail is word filters and PII. Here I had set it to block
dog
,canine
,pets
and had asked that emails be masked.a. In the first example I can see that while the user prompt was sent to the model, the model response included the word
dog
and was thus blocked by the guardrail and the query was not answered.b. In the second example I can see that the original model response had the email address in plaintext, but once the PII guardrail was activated the response the user sees has that email masked.
Learnings
These are pretty incredible results from a setup that took ~10-15min depending on how extensive I wanted to customize the guardrail. My big impressions were around the guardrail applying to both user input and model response based on my word filters and how quickly I can get to testing the guardrail with a new model.
Let me know in the comments what your use cases for guardrails could be and leave a π¦ if you made it this far!
Top comments (5)
Amazing to read
Very cool and easy setup. DeepSeek has great serverless prices on Bedrock, but does Guardrails have costs? I feel like everything on Bedrock is either super cheap or expensive lol.
Thanks for the feedback! Guardrails do have a cost, you can take a look at the pricing specifics here. The charges for content filters, denied topics and PII are based on every 1000 text units (each text unit contains up to 1000 characters). Word filters is free. For the setup I demonstrated here my total AWS Guardrails charge for a week of testing was $0.01
Curious to see how well these filters hold up under real-world usage. Anyone tried pushing the limits on prompt attacks?
Happy to try this in a future post and see if we can get it to break! π¨