🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - From principles to practice: Scaling AI responsibly with Indeed (AIM3323)
In this video, Mike Diamond from AWS and Lewis Baker from Indeed discuss scaling AI responsibly. Diamond introduces AWS's eight-dimensional Responsible AI framework and the three-line defense model, emphasizing "responsible by design" through baking, filtering, and guiding strategies. He demonstrates the newly published AWS Responsible AI Lens in the Well-Architected Tool and shows how agents can leverage best practices in Quiro. Baker shares Indeed's practical implementation, detailing their AI Constitution approach for Career Scout, which processes 10.6 million AI responses monthly with 17 active guardrails. Both speakers stress that responsible AI must be embedded from design phase, not treated as a policy checkpoint, with Baker noting Indeed's responsible AI team operates as infrastructure within R&D.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction: The Inherent Responsible AI Posture of Every AI System
Good morning everybody. Great to see a full house here. My name is Mike Diamond. I'm the Principal Product Lead for Responsible AI for AWS, and I'll be joined today by Lewis Baker, who's the Senior Data Science Manager and Head of Responsible AI for Indeed. We're going to be talking about scaling AI responsibly.
Every AI system incorporates a specific posture towards responsible AI, by which I mean it's built upon a set of decisions and positions regarding how it should operate responsibly and minimize risks. I think what I'm saying will become clear if I walk through a couple of use case examples.
Let's take the first example: a real estate company building a system to output property descriptions of condominiums for buyers and sellers. They have to consider several questions. Does the generated text welcome all buyer demographic groups equally? Are the property features accurately captured, or are there hallucinations in the descriptions of the properties? Have any private details leaked in the descriptions from previous owners and occupants? And does the content they're using, the images they present, contain unsafe or illegal content?
Let's quickly think through another example that many folks I know are working on today: a shopping agent for an e-commerce site. As this e-commerce company is building the site, they have to consider several questions. Can the agent provide specific recommendations to individuals that are meaningful for that individual but also equitable across the different demographic users of that site? Is it susceptible to accumulating unauthorized charges or exceeding budget? Does it inappropriately share personal data across the payment rails? And is it vulnerable to manipulation attacks designed to trigger unauthorized refunds at scale?
Each of these questions across these two use cases represent inherent technical properties of AI systems, whether the builders of those AI systems address them intentionally or not. Of course, there are consequences of not addressing them intentionally.
Rising AI Incidents and AWS's Eight Dimensions of Responsible AI
The OECD, which is the Organization for Economic Cooperation and Development, has an AI monitor that they use to track incidents and hazards over a number of years. An incident is defined as a harmful event, an AI event that has harmful impact upon individuals, and a hazard is one in which something potentially could have caused harm to individuals but may have caused reputational damage. As you can see in the chart behind me, in October of 2025, it hit the highest point with 509 incidents, which is a 95 percent increase over the same time from last year. This rise during this time period corresponds to the expanded use of AI that we've seen over the past couple of years with generative AI.
At AWS, we define responsible AI across these eight categories that you see here. When I said earlier that there are technical properties that are inherent to an AI system, these are the types of properties I'm talking about. Traditionally, folks who are trained in machine learning are trained to optimize predictions towards accuracy against the loss function. Responsible AI is the art of maximizing the benefits of an AI system and minimizing the risks across these eight dimensions, including the trade-offs that are required between them.
Very quickly, these dimensions and properties that we track here at AWS are as follows. Controllability is the property around having mechanisms to monitor and steer the behavior of the AI system. Privacy and security is around appropriately obtaining and using the underlying data and models that power that system. Safety is the property around preventing harmful misuse around things like scams. Fairness considers the impact of the outputs of the system upon different stakeholders and groups.
Veracity and robustness are the properties around achieving correct outputs even with unexpected or adversarial inputs. Explainability is around understanding and evaluating the outputs. Transparency is similar to explainability, but it's around guiding stakeholders and enabling them to make informed choices about the use of that AI system. Governance incorporates best practices for these technical properties of AI systems, and I'll be talking more about that as I go.
Challenges in Scaling Responsible AI Across Organizations
In my role as a Principal Product Lead for Responsible AI, I get to meet with many of you folks and customers who share with me some of the challenges around addressing responsible AI at scale in their organizations.
One of them is that each of these technical properties requires a certain amount of expertise. There's a science to each of these properties. To address fairness, there are multiple ways that you can measure it. For looking at robustness, there are multiple ways you could do that as well. So there's expertise that's required.
The second challenge that I hear as I talk to customers is that there are many tools provided in the open source community by AWS, by third party vendors, but piecing these tooling together into a holistic solution that you could give to your builder teams across the AI lifecycle is much more challenging. In some organizations, investing in responsible AI is seen as a bottleneck to innovation. We're going to talk more about that.
I'm on the responsible AI team here at AWS and I've seen this personally, but many of the responsible AI experts within organizations feel overwhelmed as the number of AI use cases they need to support has increased so much over the last couple of years with those teams staying constant. One healthcare company I work with told me that they have a backlog for the responsible AI expert team of over 1,000 use cases to review.
Responsible AI also involves compliance, depending on the jurisdiction that you're operating in—within Europe, within Colorado, within California—as well as management standards like ISO 42001. Builder teams have to deal with outputting evidence in the right format to comply with those regulations. So how do we address these challenges?
Three Overarching Strategies: Baking, Filtering, and Guiding
The first thing I want to say, and maybe this is going to be controversial or maybe not, is that across all of the responsible AI technical properties or dimensions that we just discussed, when you look into mitigating risks, there are three overarching strategies you could use to address any of the risks. I'm calling them baking, filtering, and guiding.
Baking involves building into the AI system itself the desired behavior that you want. If you're looking to mitigate risks of unwanted bias, you may do this by building out datasets with distributions of demographic groupings according to how the system will be used. If you're looking to mitigate risks of hallucinations, going back to that condominium description use case we were discussing, you may have a RAG pipeline that would ground the answers in data.
Filtering is around blocking both inputs into the AI system as well as outputs that come out of it. Going back to our use case with the condominium description, we may want to put in guardrails that would filter or block out the personally identifiable information from entering into the AI system or the output. As that system integrates with APIs or returns to the user, you may be filtering out. That's filtering.
Guiding is around steering your users—the users of the system, which could be human users or they could be agents or other systems—around the proper use of that system and the intended use. It's about being transparent about the limitations of the system through things like data cards, model cards, and AI system cards. So those are the three overarching techniques that we use to address responsible AI.
Applying the Three Lines of Defense Model to AI Risk Management
Maybe by a show of hands here, how many folks are familiar with the three lines of defense model that's used in regulated industries to address risks? I see a good number. That's good. What I've done here in this slide is applied that model, which I'll describe, to AI risk management.
If you start all the way over on the left, that's your first line. Those are your builder teams and they're responsible for building the actual safeguards and controls into the AI system itself. Your second line is that group that I talked about—the group of AI experts that support the first line. There will be multiple second line teams in organizations. You may have a cloud security second line team. You may have compliance, anti-money laundering, or other forms of risk management that you're working with. That second line team will work closely and guide that first line team, help them, and set up practices.
Then all the way on the right you have the third line, which is ultimately your internal audit and your independent assurance roles. These teams interface with the external auditors on the right that you see there.
We work with different agencies and consulting agencies that provide internal security and overall assurance to the organization. When organizations talk about responsible AI being a bottleneck, what I often see is that responsible AI is being addressed in the handover between the first and second line to the third line. If you're addressing your responsible AI there, it's too late because now you have to rearchitect and start over again.
Just as in security we solve this by the notion of secure by design, which is a concept many folks are familiar with, we use a concept called responsible by design. It involves taking the policies ultimately set by the third line that relate to the various management standards and regulations, translating them into a set of best practices by the second line, and pushing those best practices to your builder team. This eliminates a lot of the bottlenecks and can accelerate your development practices.
That's what we've done here at AWS. We have defined what we call our responsible AI best practice framework, which is represented here on the slide. This is the framework that's used by our builder teams as they build the AI services that you'll hear about over the next couple of days or that you're using currently. There are a couple of benefits of doing it this way. The first benefit we just discussed is that the more you could shift left your policies into your builder team so that they're building correctly the first time, the more you could accelerate your innovation.
The second benefit is that if you think about it, on the right you have all of your management standards, your policies that your organization is setting, and your regulations that we discussed. Then all the way on the left, you have all your tools, and there are new tools that come out every week in the open source community by vendors that address different properties of responsible AI. Those are two very dynamic layers. If you put a best practice framework like I'm positioning here in the middle of it, it gives stability and robustness. You can map and onboard new tooling onto that best practice framework, and the outputs can be in a standardized format that can meet multiple regulations.
AWS Responsible AI Best Practice Framework and the Well-Architected Tool
What is the best practice framework? I'm going to go through it at a very high level. You can see it here on the slide. It spans across the AI and ML lifecycle: design, develop, and operate. The first set of best practices that we've defined for our builder teams is around narrowly defining the use case, the intended use case. This is a very common problem that I've seen with builder teams. If you define it very widely and broadly, you're increasing your risk exposure significantly. Narrow the use case properly and you narrow the risks right up front.
For that use case, we've defined a set of best practices to help identify the inherent risks. Consider the actual stakeholders individually and then go through a framework like the one I had with the eight dimensions and consider how for each of the stakeholders what are the potential risks. Then rank those risks according to their impact and according to the likelihood or frequency with which they'll happen so that you could address them. This sets up best practices around identifying risks for builder teams.
The third best practice is really a working backwards mentality. Initially upfront, even during the design phase, establish release criteria, metric-based criteria for how you're going to measure the highest risks that you're concerned about from the previous exercise. Define the metrics and the thresholds for those metrics as release criteria. This is represented intentionally as a circle because this will be iterative. As you learn more about your AI use case, you may want to adjust those as you start to work with the data. It's important to set those goals up front as design principles and then work backwards from the metrics.
The next set of best practices that we've defined are during the develop phase. For each of the release criteria that the teams establish and for each of the risks, you need a way to test it. Even before we design the system, we design the testing sets that will test the individual risks. Make sure you have, based on your understanding of the inputs and outputs, if you're testing for fairness, the right demographic groupings within your data that are in a statistically valid way. Develop data sets and you could use one data set across multiple risks, but make sure you have data sets for testing all the risks.
Then design and build the AI system, which is where the previous slide we discussed about baking, filtering, and guiding or steering users comes into play. Using those three overarching strategies, design the system to address those risks and then actually test it by running your evaluation with your evaluation suites. At the end of that, you may have risks that are still present, in which case this is why the process is circular and iterative, or you could accept those risks but document them. That's really where the next set of best practices comes in, which is that our teams will build out guidance tools, really data cards, model cards, and AI system cards which we provide to users of AWS. These guide them and are as transparent as possible around the limitations of the AI system itself.
Then when you get into the monitoring phase, define a set of metrics that relate to your specific release criteria and make sure you're monitoring the same set of metrics on an ongoing basis. Last week we actually published a version of the Responsible AI Framework that I just discussed, which we use internally. This is now inside the Well-Architected Tool. Previously within the Well-Architected Tool there was a Machine Learning Lens and a Generative AI Lens, and now there is a Responsible AI Lens that's based on the AWS Responsible AI Best Practices Framework that I just discussed.
The focus areas in this pie chart are the same focus areas for the lens that you see on the left, and for each of the focus areas, there are between one and five questions that you answer. This is showing one of those questions about how to define the specific problem you're trying to solve. For each of the questions, there's one to five sets of best practices for fulfilling that. On the right side there is a paragraph which tells you about that best practice with a link which opens up a guidance paper that is very valuable and will give you more information about the implementation steps for implementing that best practice. This framework is also available on GitHub. We published it there too.
Once you complete the Well-Architected Tool and answer all the questions, at the end of it you get an assessment, so you'll get a list of high, medium, and low risks to consider, and you'll get an improvement plan which is shown on the bottom. This improvement plan will be specific to how you've answered the questions. I encourage you to check that out. That's our first publication of the AWS Responsible AI Framework and we'll be updating that over time.
Agent-Driven Development: Implementing Best Practices in Quiro
The last point I want to make before handing it over to Lewis is that once you've defined your best practices in a framework, it's not just your human builders that benefit from this process, but agents as well. As we all are investing more in agent-driven development and specification-driven development, agents can read natural language and can read your best practices just as your builders can. Gartner talks about guardian agents and using them for trust and safety. This is that concept: defining your responsibility practices as a set of best practices for your agent.
Let me show you a little bit about what I mean by that. This is Quiro, which is an integrated development environment, and it has within it an agent. One of the novelties of Quiro is that it incorporates the idea of specification-driven development. I should say it does not come with the AWS Responsible AI Framework in Quiro. I've added it here, and it's not hard to do. I'll explain how I use the Quiro constructs to do that.
Quiro distinguishes between two types of specification files. One is steering files, and those are more of your organizational policies and best practices. So the Responsible AI best practices framework fits very neatly into that concept, and that's teaching the agent about how to guide the builders.
Once a builder starts a project, they describe the project to Quiro in natural language, and then Quiro outputs use case specification files, which are at the top. This is the same condominium description use case we discussed earlier. Quiro has read the best practice framework and starts creating specification files to address the different parts of it, which your builders should now review before Quiro starts creating code. Ultimately, they will be creating code according to the best practices.
Indeed's Scale and the Real-World Risks of AI Systems
Hi everybody, my name is Lewis Baker. I am a data scientist and manager and the head of responsible AI at Indeed. Mike and I have been talking for quite some time as mutual information sharing about product usage and responsible AI, and he asked me to come up here and tell you a little bit about a practical application of responsible AI at Indeed.
For those of you who are unfamiliar with Indeed, you might know us as the job website. What you might not know immediately is the sheer scale that Indeed operates. Indeed has 635 million job seeker profiles, which is resume data from individuals looking for jobs, and we have 3.3 million employers who are trying to sort through all of those people looking for jobs.
A lot of the time when I talk to people within the industry, but more often at social gatherings, and I tell them that I do responsible AI at Indeed, they say, "Oh, I didn't realize that Indeed used AI." The reality is that AI is the only way that you can sift through that volume. Indeed is not a job board. Indeed is a search engine, so it starts all the way at the top with all the jobs in one place.
You need more data about those jobs. You need to understand what it is that people actively need in order to do that work. You need to advertise those jobs to get more traffic to the website. Once you have more job seekers, you need to understand what exactly they are looking for, what skills they have, what skills they do not have, what skills they list. Does anybody put Microsoft Word on their resume anymore? That is a question you need to figure out.
As you identify the jobs that people are applying to and the skills required, you are able to identify what leads to success, which gets you more employers and the cycle continues. I go through all of this to tell you that Indeed is an AI company and AI is deeply embedded in every single thing that we do.
Let me give you a specific example about how to do this responsibly. This is the Indeed Career Scout, and Career Scout is a fairly straightforward chatbot experience. You come in with a list of recommended options that you can go through, or you can just have an open-ended conversation about your career. Some of the highlighted tools are to build up your resume, to do a job search, or to search for jobs outside your current field.
If you go down that first option, you can have an open-ended conversation about, for example, "I have been working as a receptionist for quite some time. I have been interested in the creative field and would like to try that out." Career Scout will guide you through what skills you have, what skills you need, and what sort of options might be available for you to make a lateral career move.
Afterwards, it will give you a series of jobs that you can say, "Yes, is this what you are looking for?" and eventually you can apply to those jobs. Conceptually, it is very simple. You are just having a conversation, you are just trying to find a job, and with any luck, we will help you get one.
There are challenges with this. Every simple system has a million ways to be used incorrectly. If you are at this talk, you probably have heard of things going wrong before, so I will speed up through it. For example, less than two years ago, a Google AI overview, if you asked it how many rocks a person should eat, it recommended about one to get your daily nutritional value. That is not great. It is not super harmful, but you get the point.
Later than that, if you asked it about smoking while pregnant, it would say two to three cigarettes a day is great. That is not correct, but again, you get the idea.
That's Target. It's not for the record, but again, not good. I've purposefully given you some fairly mundane ones. You might be thinking, what's the big deal? Let me tell you, I've seen some truly terrible things, awful things that I won't share with you here. But if the safety of other people or concerns of people who should be informed not to follow the advice to eat rocks don't concern you, then perhaps it might concern your bottom line when in this very real circumstance—an early, to be fair, very early version of ChatGPT—when embedded as a chatbot into a car sales dealership, not only sold someone a Chevy Tahoe for one dollar, it said that was a legally binding statement. I guarantee you that harm for the company is real and is a thing that can come out of agents.
The AI Alignment Flywheel: From Human Values to AI Constitution
So the question here is how do you keep AI on the rails? Mike walked you through a high-level framework and some specific tools within Quiro to do spec-driven development. I'd like to go into specifically how Indeed has done this. Going back to the AI dimensions that Mike talked about, there's a whole landscape of things to worry about out there. For Indeed specifically, we have a very robust security team, privacy teams, and governance teams. So for responsible AI, I'm going to concentrate on these four things—not to say I don't care about the other things, but we have to get out of this room eventually. We're going to focus on safety, fairness, transparency, and veracity. So how do we make sure that things are truthful, fair, they make sense, and they are safe for use?
Our general strategy is to follow this flywheel: anticipate what could possibly go wrong beforehand and use this to inform things at the design stage. This is what Mike was talking about when it comes to spec-driven development. You need to center having something safe as a design requirement, and you need to have tangible metrics in order to get there. Once something's out in the world, you cannot just trust that things are going to go the way your developers thought they would. You need to put in guardrails in place, and I'll talk through a little bit about what some real-time guardrails are here, but generally any sort of moderation system can be viewed as a guardrail.
And then lastly, you need to observe. I am probably speaking to the choir, but if you don't log it, it didn't exist. You need to know exactly what happened. You need to be able to learn from that experience so that you can anticipate future problems and just in general make your product better. This entire thing, as an umbrella term, is known as AI alignment. The idea of AI alignment is that you want your open-ended AI system to be aligned with your values. That kind of makes sense up front. When you dig into that concept a little bit more, the question then becomes: what are my values?
I know that my job at Indeed is to help people get jobs. I want whatever I do to help with that, and I know in general I don't want people to have a bad experience. But then there's all the stuff in the middle, right? I obviously don't want my chatbot to be making threats of violence. I don't want it to be using hate speech. What kind of tone do I want my chatbot to have? Do I need it to be very rigid and stuffy? Do I want it to make jokes? Do I want it to stay specifically on track with helping people to find a job through the job search? Do I want it to be able to do other things, like if somebody asks me to do their math homework, am I going to let the chatbot do that? These are all things that need to be settled way, way, way before you hand this problem to your responsible AI team. This is a problem of human alignment.
The way that we handled this at Indeed is we got all my lovely peers from a whole bunch of different departments in a room for like three hours, and we hashed out exactly what it was from each of our perspectives that a good product looked like. I can speak of course for responsible AI, but trust and safety, security—they have a very specific posture that they want their agent to also uphold. We put all these things down into an AI constitution , and it is exactly like it sounds. It is a big old document.
Where people established principles and guidelines. We want to help people get jobs. We do not want to pose any sort of privacy event. We don't want to have any sort of security threats. We do want the experience to be relatively seamless. We do want to keep people on the website to do their searches.
From this, you get your spec. This is where every time a new product is developed, people will reference the AI Constitution and they are able to align it by design. They are able to say confidently what sort of tone their chatbot should have, what sort of data it should have access to, and what sort of data should be deleted after a session is over.
We then further go into this by saying, great, now that you've built it to this, let's build guardrails that check to make sure that it works. Guardrails are as simple as having a parallel prompt or maybe even something in the system prompt that just says make sure that whatever is said follows X, Y, and Z standards.
At the very end you observe. You make sure that X, Y, and Z standards are followed. You make sure that you label anything that got flagged by your guardrails so you can monitor it and see what kind of harm occurred. You look into it and identify things that you missed.
This is how you go from big human alignment to AI alignment. I guarantee you it's harder than one slide would state.
Career Scout in Practice: Red Teaming, Guardrails, and Platformization at Scale
Let's look specifically at how we did alignment for Career Scout. Career Scout is an opening conversation about what you want your career to be. To anticipate the issues that might happen, we sat down and created a series of AI red teaming events with specific rubrics that would calibrate what was and was not acceptable.
Before anything got out the door, we created an adversarial AI agent, an attacker LLM, and we loaded it with a prompt. That prompt could be to try to discriminate, perhaps try to get somebody's Social Security number, or perhaps try to engage someone with a scam.
The adversarial LLM then goes and talks to our chatbot over and over again. Every single iteration we have a series of rubrics to try to identify whether a scam successfully got past our guardrails or not. We rinse and repeat this. We do this several thousand times with several different agent personas to see how robust our guardrails actually are.
Now when things go into production, we have anticipated a great range of potential harms that we can guardrail against. The most basic of this is standard content moderation. Even that's very complicated, but in general, one of our values is we don't want our chatbots to swear at people, so we have filters to make sure that there's no swearing on our platform.
Things get more complicated and more contextual, so we also have contextual guardrails. These are secondary system prompts that evaluate whether something is going along our terms and conditions. These are the sort of things that allow you to detect whether something is a harm or not.
If you ask the agent to tell somebody to kick rocks, that's not a very nice thing to say. We have a guardrail prompt that says do not allow the following types of speech and also give it this flag. This one is flagged as a harm. It says do not facilitate hateful or harmful topics of conversation. An LLM as a judge has rated this above our threshold limit and has flagged this as harm.
Instead of generating a response, the AI now generates the response, "I'm sorry, I can't tell people to kick rocks." That is functionally what a guardrail looks like.
After you have your product and you've tried to anticipate how it could go wrong, you build guardrails to keep it from going wrong. You then observe to see what actually happened. A major part of this is just logging events. You don't necessarily have to log them forever, but you do need to log them to know what happens. Anomaly detection is a big thing.
Is there any event that's causing your moderation system to get tripped more regularly than not? It could be an adversarial attack. It could be that something's trending on TikTok. It could be that something's broken in your system. Either way, you need to be able to detect it. There's also something that we lovingly refer to as our unknown unknown analysis, which is that you take a very large sample of events that were not flagged and you press it through a bunch of experimental things to see if there's anything you might have missed. You perform cluster analysis to see if there's any conversations that are on the edge. It could be that 99 point something degrees are people trying to search for a job, and there's always a fraction of people out there trying to figure out how to build a car from scratch, and you want to figure out what's going on there and how it got past your guardrails.
I say all of this because I really want to drive home a point that Mike made upfront. Every single AI system has a responsible AI posture. If you are driving responsible AI from a policy perspective, you are too late. Responsible AI begins before a single line of code is written, before a single wireframe is drawn up. Responsible AI is an infrastructure investment. Responsible AI at Indeed is part of our AI infrastructure organization. We are a necessary part of R&D.
To give you some demonstrated things, we currently have 17 active guardrails. We have them propped up across every single AI interface at this company. We are currently processing 10.6 million AI responses a month. If we were a policy team, we would not be able to do all that by hand. We can't just go to every single team and ask them to follow the rules I wrote down. What we've done is we've created a series of tools, we have created a series of checks, and people cannot release their model into production until they have passed those checks.
The takeaways that I want to give to you right now are as follows. First and foremost, the principal part of responsible AI is knowing what your values are and getting alignment behind them. You need to know what your company does and does not want. You need to know what is and is not possible, and you need to tell your developers that because they're going to assume if you don't. You need to embed responsibility at every single stage of development. Every single system has a responsible AI posture, whether you like it or not. There's a whole mess of ways that something can be used. There's a whole mess of regulations that you might not even know about. You need to anticipate those. You need to build up guardrails for them, and then you need to monitor what actually happens.
At the very end, find a path towards platformization. Mike gave you several wonderful options earlier. You can build this stuff into your development. You can build checks in. You can do things responsibly so your developers don't have to. That's everything. I'll have Mike come back up to close this off.
Thank you. So just some links if you want to scan the QR codes there. If you want to learn more about responsible AI at AWS or Indeed, those are the first top two boxes that'll take you to the websites. The responsible AI lens that I showed, which is available in the Well-Architected Tool, that's the QR code that'll get you there. And then of course Qiro if you're interested in using that, that's the link there. If you're interested in learning more about AI at AWS, here are some courses to consider for your learning journey.
; This article is entirely auto-generated using Amazon Bedrock.





















































Top comments (0)