Kazuya

Posted on Dec 11, 2025

AWS re:Invent 2025 - Innovation in Identity Security: how we protect the cloud & help you do it too

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Innovation in Identity Security: how we protect the cloud & help you do it too

In this video, AWS Identity Security leaders Ilya Epshteyn and Kristen Haught, along with Capital One's Chris Schultz, discuss identity security evolution and best practices. Kristen explains how AWS achieves authorization correctness through mathematically proven verification, processing 2 billion authorization decisions per second with zero specification mismatches across quadrillions of evaluations. Ilya introduces data perimeter controls using Service Control Policies, Resource Control Policies, and VPC endpoint policies, with reference implementations available in AWS repositories. Chris shares Capital One's journey implementing data perimeters since 2019, emphasizing role path segregation using application identifiers and proper PassRole/AssumeRole controls. The session concludes with new launches including AWS IAM Outbound Identity Federation for eliminating long-term credentials, AWS login for simplified CLI access, IAM temporary delegation for marketplace integrations, and IAM Policy Autopilot for automated policy generation from code analysis.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: The Evolution of AWS Identity and Access Management

Thank you for coming here today. My name is Ilya Epshteyn, and I lead the Identity Security and Solutions team within the Identity and Governance Organization within AWS. I'm joined here by Kristen Haught, who leads the Identity Security team, and Chris Schultz, who leads the Cloud IAM Team at Capital One. Before we get started, I just want to talk a little bit about how we got here.

So when AWS started, there really wasn't anything like IAM. You had your username and password. The same username and password you used to buy books was the same login that you used to go ahead and use SQS, S3, or whatever was available. Obviously, as the cloud thing became real and startups became enterprises, we realized we needed something more. So in 2011, we launched IAM, and that of course gave us all the primitives like IAM users, IAM groups, and policies.

Very soon after that, we created roles and federation. In 2011, we actually launched our first machine identity, which was EC2 instance profiles, so customers didn't have to have long-term credentials on their EC2 instances. That was all good, but very quickly our customers outgrew a single account environment, and customers started building more and more AWS accounts. So we needed something more.

In 2017, we launched AWS Organizations and Service Control Policies. How many folks here use Service Control Policies? Okay, awesome. SCPs have allowed you to start governing your multi-account environment. Customers wanted more and more of those controls, so over the last bunch of years, we started adding more capabilities like enterprise guardrails and data perimeter capabilities.

Customers also wanted to have more visibility. Tell us what's happening across my multi-account environment. So we invested a lot in Access Analyzer to be able to do things like who has access to my resources outside of my trust zone, outside of my account, outside of my organization, things like unused access findings and things like that.

Last year we launched RCPs. How many folks here are using RCPs? Okay, hopefully it will be a little bit more after this. We also launched declarative policies, which allows you to enforce resource configuration. RCPs, of course, allow you to define the maximum permissions allowed for your resources.

I think this year it's really interesting. We're kind of in a new era where a lot of our customers are asking to take their strong identity foundation and be able to use it outside of AWS. For example, being able to use your STS tokens with outbound IAM federation and be able to access resources that are outside AWS cloud. Customers are also asking us to go ahead and make it easier, for example, to onboard marketplace products, go ahead and allow temporary delegated administration. We'll talk a lot about some of the innovation a little bit later.

As the cloud evolved, so have your speakers. I'm not going to go and tell different stories here, but we'll hang around and you could definitely talk to us. There are some interesting stories here. My main point here is we've been doing cloud security and IAM for a long, long time. Our goal for this session is to take our expertise and our experiences and pass it on to you, so that whatever we've done in cloud security, both in AWS as well as our customers, you could do the same thing in your organizations.

AWS Identity Vision: Building Trust Through Consistency and Transparency

So how is our session going to go? First, Kristen is going to talk about security of the cloud. So what are all of the tools, the mechanisms, the capabilities that we use to secure the underlying identity across AWS services. Then I'll come back and talk about security in the cloud. We'll talk about what are the things that we're doing to make it easy for you to uphold your side of the shared responsibility model.

Then Chris is going to talk about how they've implemented some of the IAM best practices, including data perimeters and other IAM controls at Capital One. He'll talk about their journey and lessons learned. And then we'll close off with some of the awesome launches that happened over the last week as well as during re:Invent.

Now before I pass it over to Kristen, one of the things that I wanted to share is what is our vision in AWS Identity. We view identity security really as a foundation of trust, because every request, every API call that we authorize is really somebody's business, somebody's customer's data. It's really the source of trust in everything you do in AWS, so we take that very seriously.

For us, it's not just about making sure that things work. We also want to make sure that things are straightforward to use and things are expected. So you'll hear a lot about things like consistency and expectations that customers have when they use AWS Identity. We are the protectors of trust, and we make sure that we uphold that responsibility.

Ultimately, what we're trying to do really is empower builders. You always hear from our security leaders, security should never be a barrier to innovation. Security should be something that enables innovation.

And our goal is to make sure that the default path is the secure path. That's our vision. Now that sounds good, but we actually have specific principles that we use to anchor on this vision, and there are four key principles that I want to call out.

The first one is consistency. Customers want to see consistency in how AWS services, over 300 services, integrate with AWS IAM and other identity services. Because if it's consistent, then you know what to expect and then you can secure it, so that's number one. The second one is we want to make it straightforward that you can secure or block the service or any part of a service. Some customers want to make sure that they enable certain capabilities or features, or they want to disable them because of regulated reasons or some other requirements. We need to make sure that you have control and the right toggles to be able to turn on or turn off the different capabilities of a service.

We also want to make sure that the existing controls always hold true. If you've implemented a policy that says this should be denied, an AWS service should not go ahead and implement a new feature that creates a new path that allows that to change. You have to explicitly be able to change that yourself. And the last one is everything should be transparent to you within your AWS accounts. Identity has a lot of different primitives. There are different ways to impersonate users and access. It doesn't matter. You should be able to see that in documentation how service behaves. You should be able to see that in CloudTrail, and all access should be transparent to you. So with this vision in mind and also the principles, I'm going to pass it over to Kristen who will talk about the identity security of the cloud.

Security of the Cloud: The Shared Responsibility Model and Identity Security Team

Awesome, thank you Ilya. So as you mentioned, I'm going to talk about identity security of the cloud. I'm going to start with the shared responsibility model. Who has heard of the shared responsibility model? Ah, good, good answer. So one of my first jobs at AWS was to support our audits. I traveled around the world and facilitated auditors through data centers, and it was fascinating to learn about our physical security controls. And then I also facilitated how we collect evidence across various services for our PCI audit and then our SOC audits.

How many of you have used AWS Artifact to download an audit report from the console? Awesome. Yeah, depending on your needs as a customer, you might use different reports, like the PCI report if you're processing payments. The SOC 2 report or the SOC 1 report if you're a publicly traded company. Either way, it's a great resource for you to understand what are the controls that AWS puts in place to uphold their side of the shared responsibility model.

So in the context of identity and what I'm going to talk about today, customers are responsible for selecting who has access to their resources. And AWS is responsible for only authorizing access that has been specified by you. So in our SOC report, as I mentioned, there's a control, Control 3.5, and it specifically says what I have up here, that AWS prevents customers from accessing AWS resources that are not assigned to them. And it talks to you through how the testing procedures were performed and whatnot.

But what I really want to talk about here is when I was working on those audits, it taught me a lot about how AWS upholds our side of the shared responsibility model. But it also triggered my curiosity for how do we implement these controls across all of our services, especially when AWS is growing at such a rapid pace. And I've had the privilege over the past almost 12 years at AWS to work on many different initiatives with a lot of different services to raise the bar on security. And today I'm excited to tell you a little bit about what we're doing in identity to raise the bar in that space.

So the Identity Security team, we work across AWS identity services. And we exist to reduce risk in those services and ensure that they're built throughout their lifecycle with security in mind, right? So our team of security engineers, we have three core functions. One, prevent issues at design time with defense-in-depth architectures. Two, proactively look for unknown threats. We do offensive security, red teaming practices. And when issues are identified, right, we really work to identify what do we do to prevent that issue from happening again. Sometimes that requires driving systemic changes.

And three, we own strategic security efforts that raise the security bar across our services. So one of these efforts is something I'm going to tell you about today, which is how we are driving authorization correctness and consistency across all of the AWS services. This effort's core to my team's mission

to make AWS services inherently secure and resilient against identity-related threats. But before I go into the details of that effort, I want to talk to you about something else that's truly foundational to my team, as well as every other engineering team at the company. Any guesses on what that is? You can just think of it in your head since we're in a session like this. If anyone thought culture, you were right.

Fostering a Security-First Culture Through the Weekly Security Meeting

While technology empowers us, we still work with humans and still rely on humans. In this flywheel that I have up here, it works for all different types of organizational structures, but it enables that momentum to encourage engineers to identify issues, leveraging leaders and a culture to celebrate those issues, and building momentum to ensure as things get addressed that systemic changes are put in place that then reduce the amount of issues we identify but encourage and make that cycle self-reliant over time. At AWS we have what we call two pizza teams. You might have heard about this. It's the size of the team needs to be about the right size for two pizzas to feed them.

And at AWS that means also that every two pizza team owns your own security end-to-end, hard stop, no exceptions. And each team when they identify and report issues, they start to develop this culture and expectation with leadership enforcement to own their own security. So there are a couple of mechanisms that help power this flywheel and I'm going to tell you about one, and it's our weekly security meeting.

The weekly security meeting our CISO has, you maybe have heard him talk about this in a keynote at re:Invent before. It's something that's been around since the very early days of AWS. Andy Jassy implemented this mechanism, and him and his direct reports meet every single week to review and discuss emergent security issues. The purpose of the meeting is to ensure that their teams are aligned on what are the things that need to be done on the level of the risks, what are the trade-offs, and to discuss investments that need to be made in order to address the issue properly.

And the benefit to the engineering teams is that they get to get that access to leadership guidance and security engineering expertise in a timely manner to ensure the right things are happening to effectively address the issue. And the benefit to leadership is an opportunity to consistently inspect how engineering teams are addressing security risks. So whether you walk away from this session thinking that you want to go implement your own weekly security meeting within your team or within your organization, or maybe that's too big of a leap for you but you want to incorporate some of the concepts into your mechanisms, I wanted to just share three tenets that I think are really core to implementing this.

And the first one is obsessing over prevention. Detections are good but preventions are better. So obsess over them in every single security discussion. Escalate without fear. This is a big part of Amazon's culture, not just specific to security. We tell engineers consistently, if in doubt, escalate. And most importantly there is no penalty for escalation.

And the third one is owning the hunt. We talk about this a lot. Every two pizza team owns their own security, right? So this means taking initiative to identify risks and being proactive to address them, and being fully committed and determined to achieving the security outcomes that our customers expect. And then the last ingredient here that I think is truly unfortunately the most forgotten, but maybe the most important, and it's celebrating them.

Recognize the people on your team, the people in your organization who identify security issues, or work on security issues, or are constantly thinking about security issues. Provide that recognition to them and to their peers to demonstrate how important it is. Our VP does this in our weekly security meeting. The first thing that he says when we review an issue is, who identified it? And someone says, I'm going to send an email to them to personally thank them for identifying and reporting it. So it's a really important thing to do, and it also takes very little time.

Proving Authorization Correctness: Formal Verification of the IAM Engine

All right, so with that, let's pivot to one of those strategic initiatives I mentioned earlier that's a top priority for our team. And this effort stems from customers asking for two key things. One, provable correctness of the underlying IAM authorization engine. You trust us with performing authorization, so you deserve the right to verify that it's done correctly every single request. The second one is a consistent established guardrails and appropriate access to applications at scale across your organization.

So all services at AWS are required to authorize actions, for example, S3 GetObject, and resources, an object in this case, in the same way using IAM. So let's start here and I'll talk you through the journey of us driving this authorization consistency and correctness effort.

So IAM handles authorization requests at a massive scale across every service in every region, right? It's just a series of string matching. I'm going to walk you through it really quick. Once someone explained it to me this way, it made a lot more sense. Your request has a bunch of different properties. The action you're taking, S3 GetObject, the property of the principal, the person or the application that's making the request. The properties of the resource, the ARN. The properties of the role session, like aws:SourceIdentity. And properties of the network, a good example is VpcOrgId. Which by the way, that condition key is a great condition key to use for implementing network security controls across your organization. So it's specific to making sure calls are made from a VPC within your organization, an account within your organization.

So anyways, now that S3 has assembled this request context, or this official scroll that fully describes every aspect of your request, the authorization engine then does this series of string matching. So your policy as JSON has statements associated with it. Some statements say allow, some say deny, some will match, some won't. For the ones that match, they need an allow. We go through all the criteria to identify all of the matches. If there's a deny at any point in time, what happens? We deny. If there is an allow match at any point in time and there is no deny, then we will allow. And if there is no match, then we deny. That's called the implicit deny. And we do this 2 billion times every single second, which is just mind blowing to me to think about. So that means every single second there are 2 billion API calls that IAM makes an authorization decision for.

So let's walk through this a little bit. Start with the principal here on the left. This could be a human, it could be an application, role, user, it doesn't matter. They'll start by sending a SigV4 signed request. S3 will receive that request and will use the Auth Runtime Client to check with the AuthRuntime Service to see if the signature is valid. This is the authentication portion of the request. The AuthRuntime Service will then return the identity policies associated with that principal back to S3. And then authorization happens. It happens within the service API build within the runtime client library. S3 then uses the Auth Runtime Client locally to determine if the principal is authorized to perform the action on the resource they requested. This is the line where all the policy evaluation logic, the series of string matching, happens. Once we have validated that you're authorized, S3 does what you actually asked it to do and returns a response back to you. And we do this same flow 2 billion times every second.

So we're going to build on this as we discuss what we've done to drive towards correctness and consistency for customers. So the first question that we had to ask ourselves is, how can we be sure that the policies are being evaluated correctly? How can we prove the core properties of IAM authorization are operating correctly? So some examples of these core properties that I'm talking about are denied by default. I kind of mentioned this earlier. Are all requests denied without a matching allow? So for access to be granted, right, someone has to go and explicitly add access. Another example of one of these core properties is deny trumps allow. So are all requests denied when a matching allow and a matching deny exists? This is powerful because when you use a deny, you know access is denied. It's not dependent on some allow statement that came earlier or some allow statement that will come later. It will always be denied.

So we use algorithms to construct and audit proofs of core security properties like these in the IAM engine to verify correctness. So how did we do this? Well, the team built and deployed a mathematically proven correct version of the authorization engine. And here's how it, here's the approach we took at a high level. Unfortunately, I would love to go into more details with you sometime. One, creating a formal understanding of what AWS authorization should do. Did that using a verifiable program language. We then proved the formal specification is correct for all the possible inputs using automated reasoning.

We then transformed those formal specifications into optimized code. And then we verified using real-world requests that all outputs are identical to production. As a result of this, we can now validate that the IAM engine through formal specification matching. We can now perform real-world testing at 2 billion requests per second. And we're able to now demonstrate zero specification mismatches across quadrillions of evaluations. In addition to being able to verify the correctness, it also helped us achieve engine optimizations of 65% performance gains.

Achieving Authorization Consistency Through Auth Context and Automated Translation

So once that was done, the next step was to deliver a scalable way to analyze the authorization context. The auth context is everything needed to explain the authorization decision. Our teams built a new feature that we're going to call auth context. It sends a sampling of auth context for offline analysis, and we collect hundreds of millions of samples per day. It uses a dynamic sampling methodology to ensure that we've captured the samples needed for a complete picture of all of the requests.

We then use this internal service, auth context, to run rules, do deep dives, do investigations to help us identify inconsistencies. For example, we detected an inconsistency in EC2 copy snapshot in the resource level permissions. We learned that the resource level permission specified for this action applied to the new snapshot, not to the source snapshot. This has now been corrected, of course, as I'm telling you about this. And the team's actually written a blog about it to tell you a little bit about that story. But these are the types of things that we weren't able to detect before that we are now able to detect. And we're able to do this pre-production, before services are launching. But we needed to figure out a way to enforce this, right?

The challenging part is the service inputs that are required for authorization to be performed. One, it's their API data model, everything we need to know about that service and the API. The properties of the request, which we talked about earlier. And then the service also provides the authorization logic for some scenarios, like the EC2 copy snapshot scenario. Our thought process was that was just it, we needed to reduce the number of things that we rely on the service for. We needed to do that to avoid some of these inconsistencies.

So we built an internal service that translates the service API data model into authorization inputs. The service formally verifies the translation of that API data model to the authorization inputs at build time. And then it abstracts authorization away from the service team, so they can remain focused on what they do really well, on their data models and on their business logic. As a result, this creates a simpler interface to IAM and reduces AWS's service team's efforts and responsibilities when they're launching new functionality, reducing the amount of work they have to do to integrate with IAM properly. And the benefit that matters the most is you as a customer can then rely on more consistent authorization experiences.

I wish I could spend more time telling you about the underlying authorization environment and what we've done. And anything I just said does not do the team's justice for all of the hard work they did. But some of the key takeaways before I hand it back to Ilya. One, don't forget about fostering that security-first engineering culture. When you do it, it enables your builders to be better owners. But it also enables your team, if you are in a security organization, to be able to focus on building the really awesome stuff like I just talked about.

The second one is you can learn and assess how we do service authorization through our audit reports. But we also have a service authorization reference API that you can use to be able to get that data programmatically. And then use tooling to enforce correctness. Think about that in your practice similar to how we did it. Lean on analyzers like IAM Access Analyzer for policy validation and implement those guardrails. And with that I'm going to pass it back over to Ilya. Thank you for your time to talk about security of the cloud. Here you go.

Security in the Cloud: Active Empathy and Customer-Centric Solutions

Thank you. Thank you Kristen. Now it's interesting, right? Because I'm going to pivot to security in the cloud, but based on what Kristen presented, really the session should be over. Because security in the cloud is really your responsibility. So what am I going to talk about?

This is actually really important because even though there is a shared responsibility model, it doesn't mean that we step away from the security in the cloud. AWS really is focused on making sure that it's easy and straightforward for you to implement your side of the shared responsibility model. Once, somebody in the security team told me, "The best way to describe that is that we own the success of our customers. And if you cannot implement your controls in a straightforward and easy manner, then we have also failed." So from that perspective, we have a different kind of set of goals that are more focused on the customer perspective. We want to make sure that identity security should be straightforward to implement correctly by our customers.

We want to make sure that secure-by-default and intuitive controls are available whenever possible, and that's kind of our design principles. And then from an outcome perspective, we want to make sure that security is an enabler, like I mentioned before. So besides the identity security team, which is a set of security engineers that are focused on the security of the platform, we also have solution architects and other folks within the identity service that are focused on making sure that we achieve that goal. And we have an additional set of tenets for that. From that perspective, our tenets are much more customer-centric. Active empathy is one of them. In AWS, we always pride ourselves that we deliver features based on customer input, but we want to go a step further.

We actually want to be in the shoes of the customer. We want to experience the pain that customers may have firsthand. We want to make sure that consistency is easy and that consistency is key. Sometimes we have ease of use and consistency, and sometimes we have to make trade-off decisions. Sometimes easiness could lead to inconsistencies, but we always want to make sure that we have consistent and expected results for our customers. And then one of the things, because we are solution architects, we focus a lot on actionable and prescriptive guidance, but we make sure that that guidance is actually achievable. We actually make sure that we can implement, test, validate, and then recommend it to our customers.

So how do we do that? What are some of the tools that we use? Some of the tools that I'm going to cover, just like Kristen talked about on the security of the platform and some of the internal capabilities, I'll share with you some of the internal things that we use in the Identity Solutions team. One of the things we said in order to achieve active empathy is we want to actually operate an AWS environment at scale. We want to be our own customer. So we actually operate a large AWS multi-account environment using only AWS services, no Amazon tooling, no special stuff, and we want to see what that looks like. It has enterprise networking, it's connected to external IDPs, it has all the things you need to do like patching and security services and all that.

We also implement reference implementations of enterprise controls, like data perimeter controls, which I'll talk about what those are. How many folks are familiar with data perimeters? Okay, some. But we also implement other controls as well. And then what we want to do is we want to identify sharp edges from a customer perspective early in the process. So we're actually embedded in the service team because we actually are part of the launch plan. So we actually get to validate and test different features that come out before they go to our customers.

Data Perimeter Controls: Protecting Resources, Principals, and Networks

Now, because I mentioned data perimeters, there are amazing deep dive sessions on data perimeters, so I'm not going to go too deep. But I do want to give you some context because it's going to be important for later on as well. What data perimeter controls are is basically saying things that are mine, I should be able to access: my resources, my principals, my networks. But things that are outside of my environment, somebody else's AWS account outside of my trust zone, somebody else's resources, somebody else's principals, those things I do not want to access, of course outside of trusted partnerships and third-party vendors.

And the way data perimeter controls are implemented is really a combination of various different IAM primitives. So first, we use Service Control Policies. So if I want to say only trusted resources could be accessed by my principals, I could put an SCP with ResourceOrgID condition key that says only resources that belong to my organization can be accessed. The other thing that we implement is Resource Control Policies, because I need to do the same thing on the resources. I want to make sure that my resources could only be accessible by principals that belong to my organization, or my trusted vendors, or my trusted partners.

The third thing that we do is we implement the VPC endpoint policies because VPC endpoint policies allow you to help prevent certain different exfiltration type of paths. So for example, making sure that my developers don't accidentally bring their personal credentials into the environment and then be able to access data or copy data to an environment that doesn't belong to my organization.

Now, you don't need to memorize all of these things. I do have a QR code. Our team actually maintains those reference policies in a repository, and we test and validate those policies. We think that those are a great starting place for customers to start, so that QR code is available.

Now that sounds all good, but we want to validate and make sure that those things are actually effective. So we have another set of tooling. Internally we call it Looking Glass, but basically what it is is an API testing platform. What we actually want to do is test various AWS APIs with different parameters, different permutations of access, and we want to make sure that the controls that we have implemented are actually effective. Are there any nuances, any considerations for an AWS service where maybe the default controls are not sufficient? We validate the controls that are deployed in Mirror World, and we make sure that they're consistent across AWS services. The nice thing is we also do this all the time, so we do this consistently, and we can actually detect if there's any sort of drift in those controls.

Now just to give you a little bit of an example of what kind of testing we do, if I have these two environments, let's say my AWS organization on the left side and then an external AWS organization on the right side, the only happy path is my credentials. My roles should be accessing my own resources. That should be the expectation there, and it's allowed. But any other path, my role accessing an external resource that I didn't explicitly allow because it's an untrusted resource, should be denied and vice versa. And many other different permutations and variations, so an external principal trying to access my resources and so forth. All of these different permutations that you could see quickly add up. We do this in an automated way, and we do this across different AWS services, APIs, and different resources.

And what we have done is actually take those lessons and try to externalize them to our customers as well. So at re:Inforce, one of the byproducts of this work that we're doing is we're calling out in our public repository additional considerations that you may want to know when you want to secure a specific service. For example, maybe a service is behaving in a certain way, which I'll show you some examples, where you just can't use the standard controls that you may expect. So we give you some additional options for how you could implement those controls.

Service-Specific Considerations: Providing Prescriptive Guidance for Data Perimeters

Now we use three different types of recommendations. They could be preventive controls, which could be additional SCP or SCP statements. It could be proactive controls, so maybe something like a CloudFormation hook, or it could be a detective and maybe remediative control. Those are also available as part of our repository, and we're going to continue to expand those considerations to our customers. Now I'll just walk you through a simple example of what we're talking about here.

So let's assume you're trying to implement a control that says I should only be able to access my resources with my principal from my VPC. So you can see it's a VPC endpoint policy here, and I have on the left side standard controls that I want to implement. I want to say only my principal organization ID is allowed, and only resources that belong to my organization. That's the standard data perimeter, and that's all good. But depending on the different service you use or different feature in the service you use, you may need to access a resource that doesn't belong to you.

For example, in this case there are patch baseline snapshots. Those are actually living in a service-owned bucket, so you need to know about that. So one of the things that we do is we consistently make sure that services document those resources. We then provide a sample policy of how you could implement a control in your environment to allow this if you need to use this capability. And what we're trying to do is decrease the amount of time it takes for customers to discover this information, and we want to give you those controls out of the box.

Another simple example is SNS Subscribe. It allows you to create endpoints and various endpoints that are not AWS resources, like an email address or an SMS. There's no concept of putting a resource organization ID condition against those types of endpoints, and that's rightly so. That's kind of the behavior of the service. But if you want to use this capability because that's important to you, but you want to restrict it, you want to say only my domain is allowed or only this specific endpoint is allowed, you would need an additional SCP statement. So again, we make that available in our repository so that you could use that and don't have to spend time discovering.

So with that I'm going to pass it over to Chris, and he's going to talk about how they've implemented some of these controls at Capital One. Thanks Ilya. Thank you. Great, a quick sound check. Good, great.

Capital One's Data Perimeter Journey: From Trial and Error to Accelerated Service Adoption

First of all, my name is Chris Schultz. I've been at Capital One for a few years. I actually started, as Ilya pointed out in the last slide or intro slide, working in public cloud about 10 years ago. I actually started on a cloud engineering team, and I like to joke that when they asked who wants to do this IAM thing, everybody else stepped back more quickly than I could, and I got left doing this for the last 10 years.

But you can see Capital One at a glance. We're a bank. A lot of people think we're a credit card company, but we are a bank. We're one of the largest auto loan originators in the world, and we're also really a technology company at heart. We've really transformed over the years, and I'll let you read through this timeline, but we are not a company where business drags IT along. We're a company where technology and business work together to produce outcomes for our shareholders and customers.

So how do we think about security in the cloud? There are quite a few things that you need to do to ensure your own security in the cloud that Ilya touched on earlier. When most people think of IAM roles, their top of mind is what APIs can you access. But what or who can use a role is also important. Ilya talked about data perimeters, but I want to touch on briefly our own journey in their implementation. I think it'll show you how much Kristen and Ilya's work has really made doing those types of controls much easier. I also want to highlight how we lock down role sharing in our accounts so we can be assured that who's using a role is the thing that's intended. So these are two topics that are often glossed over when we talk about IAM in the cloud.

So a little bit about data perimeters at Capital One. We started on this late 2019, early 2020, and they're really important to us. The reason we want to bring this up and the reason I'm repeating this is if you have a mental model of cloud security that is entirely focused around VPCs, this concept may not be immediately apparent to you. At some point a few years ago we started just calling it the cloud instead of the public cloud, and we probably shouldn't have done that. These APIs that are accessed, they're not in your networks, they're not your APIs. They're accessible to anyone on the internet, and the only thing really protecting you are these data perimeters.

Without those good guardrails, any of the STS tokens that are generated from your roles can be used from pretty much anywhere on the internet, and data perimeters provide those guardrails. So this is not a problem that's unique to Amazon. This is any cloud service provider that provides any kind of API tokens, but Amazon just gives us the tools to help control that.

Like I said, we started on this journey roughly late 2019, early 2020. I had a lot less gray hair back then. I don't completely blame data perimeters, but I'm sure it was a part of it. There was a lot of trial and error. We knew there were going to be hiccups along the way, but a lot of the service inconsistencies that Ilya talked about and those resources that are owned by Amazon and need to be accessed from our accounts, we had to discover those the hard way. We had to break people's applications, work with Amazon support to figure out exactly what was going on, and continually iterate on those data perimeter policies. It took us months.

So we moved through those early service issues, things settled down, and now today we've grown from hundreds of accounts back then to thousands of accounts now. And this has really enabled us to accelerate our service adoption. Most of our concerns around deploying a new Amazon service is how well can we protect it in those data perimeters. And now that we have much stronger data perimeters in place, we can allow more things, particularly in our non-production environments for developers to play with it early. So we don't have to do a threat model of an entire service to make sure it has the kind of behaviors that we want.

Controlling Role Usage: Preventing Role Shopping Through Path-Based Constraints

So we've really accelerated our service adoption without changing our risk tolerance levels, which is very important to us. It's still not perfect. There's some things we have to do every time we create or delete a VPC. We have to update a Service Control Policy with those VPC IDs. Whenever there's a new service out, we have to consult with the service guidance and make sure that they're set up correctly for the new service. But where we want to go tomorrow is basically create these things so they're set and forget.

So we're really pushing Amazon to give us some IAM condition keys that make all those manual updates go away. Ideally, we'd really love to have just a set of data perimeter policies really high up in our OU tree that we just never have to look at again.

So you've now heard us repeat this twice: data perimeters are a thing. The call to action is, these things are not optional. You really need to go out and look at this, look at the sample policies that Ilya provided earlier, and get these implemented. They're fundamental to protecting where your tokens can be used from. If you focus on nothing but trying to keep those tokens from getting out, tomorrow there's going to be some new method that somebody has to extract those tokens. These policies, these data perimeters, help nullify that as a risk. So if somebody grabs your token and tries to use it from somewhere else, it just won't work—access denied.

The next thing I want to talk about is limiting how roles are used inside your AWS account, which is also really important. Tell me if you may have heard the term "role shopping." Someone logs in, gets into an account, they need to build something quick, they grab somebody else's role and run with it. It used to be much more common of a problem at Capital One, but we've wrestled that to the ground.

The first thing I'll talk about is your AWS account is the best way to segregate access. You'll see this in the white papers that Amazon has published. Smaller accounts are better, but really the key thing is they're the most consistently deployed construct in Amazon, period. There's almost no services that disobey that account boundary, mainly because that's how they make their money—they bill against the account. Most service teams want to do that. I'm not saying deploying multiple applications in a single account is necessarily absolutely evil or an anti-pattern kind of thing. But when you do that, you do need to focus on some other things to prevent your roles from being shared and used in ways that you don't intend. You need to establish assurances so that you don't have a mess.

When provisioning a role with permissions, you need to focus on things that have AssumeRole APIs or something that leverages the PassRole permission. The best way to do this is to use explicit resource ARNs to limit what those permissions can reach. With AssumeRole in particular, you can use resource tag condition keys so that you can ensure that the primary role, when it goes to assume a different role, can only assume those roles that are tagged a certain way. But you've got to make sure you have good control over those tags as well. You don't want somebody to be able to change that. On the trust policy on the target role that's assumable, you can control it there as well because you have to list the role that can assume it. But we've seen cases where people just tend to wildcard both, and that opens the door to bad access patterns.

PassRole is an interesting case because it's a permission only—it's not actually an API. So it does not consistently support those resource tag condition keys. You can't rely on something being tagged and being able to limit PassRole that way. It may work sometimes, but there's still a big blue box in the documentation that says, "Don't do it, it's not consistent." It also doesn't show up in CloudTrail, so it's not necessarily easy to track who's using it. You can go back and see who's using AssumeRole because you can go look for those events in CloudTrail. But with PassRole, you need to figure out what is the source function that's actually using that—maybe it's EC2 RunInstance or Lambda CreateFunction.

What we did at Capital One is we included an application identifier as a topmost element of every role path, and it's also in the identity policy's ARN path as well. This allows us to ensure that every PassRole is constrained to only the roles that belong to the same application. The other thing you can do is also just limit the number of items in a role's trust policy. So if you know a role is primarily going to be used for Lambda, don't put EC2 in the trust policy as well, because that way that role could be used in either place. Even if it's in the same app, we prevent that because the odds of the code base being the same on the EC2 instance as the Lambda is generally pretty low, which means that role is usually over-provisioned in the first place.

So let me show you a couple of examples.

Managing PassRole and AssumeRole: Leveraging the Service Authorization Reference

Here we have a snippet from an identity policy. We're using PassRole, and you can see it there at the top where we have an example of the role path. Role ARN paths are actually a very old feature in Amazon. They've been around forever. You see them all the time in S3 with object prefixes and things like that.

The IAM service allows you to set your own. You have an overall length of the ARN, but I don't think there's actually a limit to the number of sub-paths you can have. We claim the top one. Every role that gets provisioned gets an application ID that's registered in our CMDB, and that really helps us organize everything.

We have controls in place. For example, if you have an identity policy that has one app ID in its path and a role with another app ID in its path, we don't let you attach it if there's a mismatch. Some fun stories in the past before we implemented this: we would have one team want to make a change to an identity policy. Maybe they're doing the right thing and removing some permissions because they knew they maybe stopped using a service or stopped accessing a bucket. And suddenly another app in the same account would blow up because they were sharing those policies.

So this is not just a security risk, it's an operational risk as well. The other example I have here is a condition key like PassedToService. We are a bank. We like belts and suspenders. We like having multiple controls. But also, you can't necessarily comment in identity policies, which is not exactly ideal. But sometimes you can use things like this to say, "Hey, just remind yourself, this role is really meant to be used for Lambda," as opposed to just also having it as a restriction. So that's an example of how to help control PassRole.

But how do you know when PassRole is necessary? I've talked to developers that think it's always required. In a lot of teams, if you've federated out your ability to implement roles, you might have given your developers access to do this, maybe because they're in an isolated sandbox account. You've put permission boundaries in place. You've got other guardrails in place. It's amazing if you don't use IAM all the time, the assumptions that people make. PassRole everywhere is one.

I used to find services listed in trust policies that had no business being there because they thought, "Well, I'm invoking this. I should put SQS in the trust policy on my Lambda execution role." Kind of benign, not necessary. But the service authorization reference that's listed here is a good roadmap to exactly what is required for any permission. This snippet, you can find the link at the QR code there. It shows the Lambda create function for the Lambda service.

You can see it takes a function as a resource type. I truncated some of the condition keys, but those are the condition keys that are available. But a couple of years ago they started including dependent actions. These are other permissions that this might need. PassRole here is listed as a dependent action. So if you come to the service authorization reference for a particular thing and you don't see PassRole, you don't need it in your identity policy, and you can pull that out.

It's really important because if you do have PassRole in there and then something else leaks in like EC2 create instance, suddenly you've got an ability to create a function or create an instance or something, a pivot point in your account that an attacker can use. The other cool thing, I think Kristen alluded to this, and this has only been out for a few months, is that service authorization reference that previously was just in a webpage that was human viewable is now available as a set of machine-readable JSON files.

This is really cool. They've really begun to expand on this, and all the service consistency stuff that they were talking about gets funneled into these files. It gives you a lot more information about how to use the services. In this example, you can see the Lambda create function. You can see where it has PassRole as well, and you can see it uses the PassedToService context key.

I also threw in tag resource as well. What they recently added is also the Boto function call. So now for the first time you can easily map from a Boto function call back to what IAM permissions it uses and then basically determine everything possible for what's relevant to those IAM policies. It takes a lot of the guesswork out of creating identity policies.

So the call to action here is really pay close attention to how you use AssumeRole and PassRole.

The whole point of this is building up your trust and assurance that the roles you deploy in your account, whether they're in a small account or an account with a lot of shared apps, are used where you expect them to be and attached to the things you expect them to be attached to, without the possibility of them being reused elsewhere in a way that you did not intend. And so with that, I'm going to hand it back to Ilya. Thank you.

New Identity Innovations: Outbound Federation, AWS Login, Temporary Delegation, and Policy Autopilot

Awesome. So with that, what we're going to do is I'm going to close out with a few interesting launches. We had actually over 15 different identity and governance launches. We're not going to cover all of them, but I want to cover the ones that are specific to this topic. The first one is actually really close to heart for Chris, because once customers like Capital One have gotten really good at deploying and managing their identities in AWS, what they have asked us is how can we go ahead and extend that and be able to access services outside of AWS. We all have different hybrid workloads, we may have some resources that are on-premise, we may need to call SaaS applications. And historically what customers would do is most likely use API keys, maybe in some vault, maybe in Secrets Manager, or somewhere else.

There's a lot of proliferation, there's still a lot of proliferation of those long-term API keys and access keys somewhere in the vault. And what we want to do is make it easier for you to have a native way to take your strong foundation of identity in AWS and to be able to access external services. So how do we do that? We launched AWS IAM Outbound Identity Federation. And what it allows you to do is obtain short-lived credentials that are publicly verifiable and be able to use them to authenticate to external services.

So what this provides is enhanced security. It eliminates the need to store those long-term credentials because basically there's a new API, it's the STS GetWebIdentityToken. You make that call, you get a cryptographically verifiable JWT token, it has all the context information about your identity. You also get a per-account issuer URL, and then what you do is you pass that on to your external service. That service could then go ahead and verify your identity.

What's also really cool is we put a lot of information about who you are, including AWS context. So things like org_id, account information, tags are all part of that JWT token. And then once it's verified, then the service or your custom application will give you a token to be able to access that application. So it's really great, it reduces the complexity. We really want to try to drive down long-term credentials, period, in AWS, and it's really standards-aligned. It's using common standards and it will work with third-party applications as well as your own custom workloads.

This is an example that I mentioned. If you could take a look at what we're putting in, some of the things that we're putting in as claims in the identity context. So you could see the org_id, you could see the OU path, you could see the principal tags. Really proud of the fact that we have a unique token issuer URL per AWS account. So it's not like it's for the entire AWS cloud, this is an issuer URL just specific to your AWS account, so we're really excited to see how customers will use that.

There's actually also an interesting use case here. You can actually call this API even if you're not going ahead to authorize or to authenticate to a third party, to actually get this context information. Sometimes maybe you're doing some troubleshooting or sometimes you're trying to validate why you're getting maybe an access denied, or you're trying to do some research. You can actually call this API and get all this rich context information right here.

The next one is, and it's kind of related to the theme of removing long-term credentials, we wanted to give you a simple way to have credentials for your CLI and for your SDK. And this we also launched right before re:Invent. It's the new simplified developer access with AWS login, and it's a new authentication method. It brings you a console-style login, it's a web-based flow. You open up your CLI, type in AWS login, opens up your console. Doesn't matter how you're logged into the console, whether it's federated, whether you're using, hopefully not using IAM users, but it works with IAM users or federated access, it really doesn't matter.

And through the web flow, you automatically go ahead and get, the key here is, temporary short-lived credentials for your CLI or for your SDK. Because one of the things that we've seen is, unfortunately, the proliferation of long-term credentials for CLI and SDK use cases.

This new approach will ideally drive that down. No long-term credentials are needed. They're temporary access tokens that expire automatically.

This enables faster onboarding, especially if you're just experimenting. Maybe you have your own personal AWS account and you quickly want to test something out. You create an AWS account, and literally within a couple of steps, you automatically have secure CLI access without having to create access keys or secret keys. It offers universal compatibility, so it works with your CLI, your SDK tools, and various IDEs.

The other one that's really interesting is IAM temporary delegation. How many of you use marketplace products here? Okay, quite a few. Historically, if you use a marketplace product or a lot of third-party products, they need to deploy some infrastructure and resources into your account. That often requires a lot of dancing, right? The vendor says we need this role to be created, or we need this resource to be created. Then you as an application team try to find which internal owner is going to approve this. Then you go to your IAM team to create a role, and that could take weeks and sometimes even longer.

What we wanted to do is simplify that process, both for our customers and for our partners. What IAM temporary delegation allows you to do is it allows our ISV partners to create this guided experience. You go to a marketplace product, you say, "I want to deploy this product," and it automatically redirects you back to your AWS account. You get to see what kind of access the partner needs to have and for how long you could set the duration of that access. Then you could either approve it yourself if you have the right permissions.

What's really cool for enterprises is you can actually have a workflow for approval. If you yourself cannot approve that, it could actually automatically flow to maybe a central security team. What's cool is that team already has the full context of what's needed. They have the permissions, they have the resources, all that context is available to them. Then once they approve it, the partner has the right to go ahead and deploy resources in your environment. The second that expiration happens, the access is gone.

What's also cool is some partners need access that's a little bit more persistent. Maybe they need an IAM role created to be able to do describe calls. Partners could use this flow as well, and what they could do is create a permission boundary that you get to deploy in your account. You get to see what that permission boundary looks like, you could even modify it if needed, and you control it. You have full control of that. Then that role could be created in your account with a permission boundary. This way a partner could have persistent access, but you have the governance and control in place. The cool thing is also, obviously, everything is logged, and it's in your account, and you have everything in CloudTrail.

This just describes the flow. The end user starts with a partner product, you set up the AWS integration, either you self-approve or you go to an admin. Then the partner uses the temporary credentials to be able to go ahead and do the deployment, or the persistent access with a permission boundary. What's really cool here, it's kind of the same type of theme, is that you no longer have to have those long-term persistent access credentials for your partners, especially if they don't need it. A lot of times we see a partner deployed something, the role is still around, it's not necessary anymore. This eliminates a lot of that.

The last thing I'll just mention, this was actually a launch that happened right before re:Invent, the day before. Has anyone seen this, IAM Policy Autopilot? This is really cool. This is what Chris mentioned about the service authorization reference. We've actually, as part of this launch, we had to update the service authorization reference to be able to reference which API methods and SDK map to which permissions.

What's cool is IAM Policy Autopilot allows you to take your code and provide it. The autopilot, which is an open source solution that's available on GitHub, first does the static code analysis. It tries to understand what permissions are needed based on your code. It also has a local MCP server. It uses the API authorization information to be able to say which permissions are needed for the things you're trying to do in your code. Then it allows you to create a starting policy that you could use. Of course, you could always refine that, but this is just a great way for developers not to have to spend hours discovering what permissions are needed.

It really accelerates how developers get to a policy that's functional and ready to use. The solution uses deterministic code analysis, and it works with all the common IDEs. It's open source, so we actually would love to get your feedback. Please try to use it and give us feedback on that.

Key Takeaways: Implementing Security-First Patterns and Eliminating Long-Lived Credentials

Okay, now to close off the session. What we said is the whole purpose of the session is that we should be able to share our lessons learned so that you could go ahead and implement this in your environments. Kristen already covered some of these lessons about fostering a security-first engineering culture and using tools to enforce correctness. I want to call that out because we talked about things like automated reasoning as an example. We actually have that in the products that you could use. IAM Access Analyzer uses the same type of capabilities that we actually use internally to prove the correctness of the IAM engine. You could use that for your use cases to be able to understand unintended access, to be able to understand unused permissions, or to be able to create policies that are right for you.

Use secure-by-default patterns. This is similar to what Chris was talking about. Don't do one-off identity decisions. Create patterns, templates, and prescriptive guidance for your teams so that they should find the secure path easier instead of trying to figure out something on their own. Establish data perimeters early. I think there's been a lot of paved path for you over the last several years. Things have definitely gone a lot better, and we encourage you to try to implement data perimeter controls.

Eliminate long-lived credentials wherever possible. We really are trying to shrink the number of use cases that would ever require long-lived credentials. So please take advantage of all the capabilities for short-lived credentials and delegate safely. We do encourage delegation because that federated model is what enables you to scale your capabilities. But you want to put guardrails in place to make sure that that is done in a safe way.

I know we're out of time. Thank you. Please complete your evaluations, and we'll make sure to be available here to take questions. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community