Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - From enterprise data mesh to AI with Amazon SageMaker Unified Studio (IND3322)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - From enterprise data mesh to AI with Amazon SageMaker Unified Studio (IND3322)

In this video, Rizwan Wangde and Sam Gordon from AWS Professional Services present a framework for building modern data platforms in financial services to support agentic AI. They identify four critical barriers: data silos, trust in data, cross-organizational governance, and data consumption patterns. The speakers introduce SageMaker Unified Studio as the central solution, demonstrating how data contracts, unified catalogs, and automated lineage tracking address these challenges. They emphasize that 40% of agentic AI projects fail before 2027 due to inadequate data foundations, and showcase practical implementations including Customer 360 use cases, achieving 80% data discoverability and 90% reduction in manual governance processes through their flywheel approach.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: The Foundation of AI Lies in Data

Hello everyone. Good morning and welcome to the last day of re:Invent. It's very good to see you here. My name is Rizwan Wangde, and I'm a Senior Cloud Architect with AWS Professional Services. I'm based out of Sydney with a specialization in data. For the last few years, I've been working closely with large financial services institutions on this very subject, and what we bring to you today is exactly what we've seen in the field, what we've learned, and what we've delivered while seeing a significant shift in the business and technology mindset.

Hi everyone, my name is Sam Gordon. I'm a Senior AI and ML Consultant working with AWS Professional Services. I've been working alongside Rizwan with a number of financial institutes on our side of the world, so I'm going to be talking about the second half of our story.

So I would like to start by asking all of you to take a moment and just think and reflect on the question that I have for all of you. If you knew the state of everything that you operate and if you could reason on top of the data that you have, what are the opportunities that you can unlock, and what are the problems that you can solve? I really want you to think about this from the context of your business and your organization. Today's session will help you to understand how you could build a future-proof modern data platform that will help you realize endless possibilities based on the insights you can create.

The Reality Behind Agentic AI: Why Data Foundations Matter

I know you've seen this iceberg representation of data and AI many, many times before, but it is the best that I could come up with, especially with what's happening today. Everyone is just talking about building AI agents. Hands up if you've not already discussed building agents or you're not already doing some kind of a proof of concept on AI agents right now. Exactly, everyone is involved in it. But the reality is that these agents need memory, context, real-time access to good quality data, and sophisticated reasoning capabilities, and all of that is enabled by robust data foundations.

The reality is that data is the fuel to have good quality AI applications. This is why I can comfortably say that there is no future for production-grade agentic AI systems without a solid data foundation. Based on my personal experience in building scalable data platforms for large financial services institutions for the last few years, I would say that there are four key elements that build that data foundation. The first one is breaking data silos, creating trust in data, establishing cross-organization governance, and then creating flexible data consumption patterns for both humans and your AI agents.

There is a lot of experimentation happening with agentic AI at this point in time, but scaling is hard, and Gartner predicts that 40% of these agentic AI projects will fail before 2027. If you need more convincing, I would only mention that there is another research that indicates that 52% of the Chief Data Officers view that the data foundation at this point in time is not ready for AI consumption. So how do we get to the state of being ready?

Four Critical Barriers to Modern Data Platforms

To make this happen, we'll have to take a closer look at the four foundations. The first one being breaking data silos. This is where the transformation and the paradigm shift needs to happen. This isn't just about technology, it's about building an organizational muscle to thrive in an AI-driven future. Additionally, how do you democratize data and make it discoverable and accessible by humans and AI agents, but in a well-secured and governed way?

Secondly, trust in data is everything. In today's regulatory environment, it is not enough for AI agents to make good decisions. You need to be able to trace it back to the source, you know where the AI agent has made this decision from, using what data. When your agents are making autonomous decisions, the quality of those decisions directly depends on the quality of your data.

Thirdly, how do you enable cross-organization governance for human and AI agents? AI, particularly agentic AI, needs very comprehensive access to your business data, and this is what is required to make good business decisions. Now, imagine this. Your AI agent has access to just a limited portion of your data. How could you imagine what the outcome of that would be when the AI agent is giving you these responses? It doesn't have the complete context. It doesn't have the complete access to all your business data to make those right decisions and to give you the right outcomes.

And finally, data consumption patterns. We are living in a time where there are multiple business units within an organization, and data is completely distributed. They are in silos, and how do you access that data? How does your AI agent access that data? How does it travel across boundaries of your business unit? But these foundations are essentially, and I'm very sure that you can relate with me right now, all these foundations that I spoke about are essentially today's barriers as well. They are currently the blockers. That's where most of the customers that I speak to tell me that they are facing a lot of problems when they're talking about breaking data silos. It's essentially a vicious cycle at this point in time, and each barrier reinforces the next one.

So silos create trust issues, and because you don't trust your data, cross-organization governance is difficult. Now, because there is limited governance, what happens is your data consumption is limited. Because you can't consume all the data from all of your organization, again, this kind of enforces and justifies maintaining those silos. These aren't isolated problems. This is essentially a vicious cycle that we are talking about. And there's another thing that I'd also like to talk about, which is the flywheel.

Now, if you've noticed, there was no agenda slide with our presentation. This is essentially our agenda. We're going to talk about these four barriers, and we are going to unravel them. We're going to talk about the challenges each of these barriers pose to us, and then we'll talk about how do we go about solving them. And this is all based on real experience with customers. Before we get into that, what do you expect from your modern data platform? You would expect something around 70 to 80% faster time from question to answer. You would expect that a mature data mesh implementation will achieve 80% data discoverability across organizations.

You would expect automated data quality at scale and end-to-end traceability resulting in 90% reduction in manual processes of data governance. And you would expect maybe 5 to 10 times increase in AI and machine learning experimentation. Imagine going from six months to just two weeks to deploy a new machine learning model. These aren't theoretical. They are proven outcomes from FSI leaders. At AWS, specifically Professional Services, we obsess working backwards from use cases which deliver real business value to our customers.

We're going to use an example across our presentation for a use case of Customer 360 Insights, where you may find it difficult to gain a unified view of customer behavior across different business units in your organization. So beginning with the first barrier, there is a lot to cover. There are four barriers. The first one is data silos. Each barrier, we are going to talk about what challenges it poses, and then we'll talk about what the solutions are, and we will show you many more flywheels. So this is just the main flywheel, and then we get into each barrier and we'll show you the solution flywheel for each of them. Just keeping you prepared for what's coming next.

Barrier One: The Data Silos Challenge in Financial Services

I was in a customer meeting while at re:Invent just three days ago with one of the largest banks in ASEAN region, and we were discussing their path forward for their modern data platform on AWS.

Guess what their number one barrier was? Data silos, and I'm sure it's the number one barrier for most of the customers that I've already spoken to and for you as well.

So, what is the first thing that you visualize when you think about data silos? A fragmented domain landscape. You may have multiple domains, and in this situation, I'm just using an example of four domains. They're using completely different technology stacks with no integration, or integration becomes very hard when you have a completely different technology stack in each of these domains. And this makes customer 360 almost next to impossible.

Now in this case, let's look at retail banking. Retail banking uses on-premises data warehouse and Oracle database, a legacy way of doing data. While the remaining three domains, you can see that they're using somewhat cloud and modern tools to interact with their data, but again, even with that, the technology landscape is completely different for all three of them. So to build a complete view of customer or unified customer view, you would need something like account data from retail banking, you would need credit scores from, let's say, risk, you would need behavioral analysis from customer analytics, and you probably would need transaction behavior from fraud detection.

But the problem is there's no way to connect these. So to build a complete customer 360 view, it requires manual discovery, which is weeks and weeks of effort. It requires custom integration to be done between these domains because of the difference in technology stack. And what you would essentially end up is having more and more copies of the same data in different domains, and that basically creates more silos. And of course you'll see that there are a lot of quality issues along the way.

And one of the banks that I recently spoke to for this specific use case was expecting maybe nine to twelve months of work just before they could have an integration across all domains just for this use case, and that's even before we think about building an ML model or some kind of data consumption. What we see from these four domains is that the current state is essentially very chaotic, right? There is a mess of peering between different business units. So can we introduce a centralized set of capabilities and make this workflow simpler to manage at scale?

Introducing SageMaker Unified Studio as the Central Governance Solution

We do need it. We look at just four of them. This may not look very complicated, but imagine having ten domains, imagine having twenty domains, imagine the kind of mess this would look like and keeping a track of that mess. So the centralized service that we essentially recommend is SageMaker Unified Studio, and particularly the SageMaker catalog capability of the SageMaker Unified Studio.

The madness we just saw can be solved by having the SageMaker catalog at the center of it all. Each business unit you can see now has a one-to-one relationship with your Unified Studio, and that's within your central governance account. And we can clearly see how clean this looks and easy to track. The SageMaker catalog is at the heart of what we want for our specific solution. It provides a full suite of governance capabilities across the organization.

It provides capabilities like creation of domain structures, which we'll talk about in just a bit, a mechanism for publishing data assets and data products, enabling data discovery across the organization, and automated access workflows, which is one of the biggest savings in time when we talk about giving access for one data set from one business unit to the other. And you start hearing more and more about how all these barriers essentially find their way to SageMaker Unified Studio, which is essentially why we had to introduce it at this point.

Now let's get back to the barriers that we were talking about. The first barrier is what? Breaking data silos, exactly.

Based on the numerous conversations that I've had with customers, they tell me that there are three key challenges to overcome, and I bet you'll be able to relate if you're already on this journey. The number one challenge is that there is no clear source of truth of data. The same entity, for example in our case of customer 360, let's think about credit scores. The same entity is defined differently across domains, and there is a conflict that arises. Retail banking may show that there are maybe 2.5 million customers that they have in their data set, while customer analytics actually shows a different number completely. So why does this happen? Because there's no clear ownership of data. There's no mechanism of tagging data to who the actual owner of that data is, so you can't really go back to the source of truth.

The second core challenge that we see with data silos is fragmented data discovery. Now, let's say you want to access the credit scores for the customer 360 project. You go in and you find that the documentation was maybe two or three years old, and it's not valid anymore. So you ask your friend Brian, hey, do you know where I can get this data from and who should I reach out to get access? You're going to spend weeks and weeks just finding out that there are 15 tables across all the different domains for credit scores. And then there's zero metadata. There is no good data quality measure. You don't know how fresh this data is, and you don't know who the owner is. So you spend weeks trying to identify before you identify that this data actually belongs to the risk department. And then it takes another few weeks to get the access permission sorted out, and then another week to realize what you want to do with the data and then see that, oh, this is the wrong data set, and we start all over again. So you should have seen this coming.

So the next one is duplicate data everywhere. Now, because of those last two challenges, you can see that because there's no integration as well, and every business unit has their own technology stacks, and there's no easy way to query the data directly from source, hence you'll be provided a copy of that data. So the more copies you create, you're essentially creating more silos for that data. And let's say when I'm working with the data, I realize that, hey, the quality is not that good, so I'm going to do some transformations on that. I've already taken the data from another organization or another business unit, and I've already done some transformations on them, and the source of truth is already lost on it. So I've already created many copies of it, and it just gets messy. So how do we solve the first barrier, data silos?

Breaking Data Silos: Operating Models and Data Ownership

As I mentioned before, we'll have solution flywheels for each of these barriers. And yes, we will start from the absolute foundation, and the first thing that we want to look at is creating an operating model and a RACI metrics. So what you see on the screen is a sample operating model that you can start with. What we want to do here is essentially define the roles and responsibilities for all the different actions that we need to take both at a central governance point of view and from a domain perspective. So what you can see here is from a domain level, you can see the domain data product owner and the data engineer are essentially belonging to the domains, and everything from security governance is essentially central to our governance. So it's not that every domain has their own separate standard for security and whatnot.

The next one is, of course, an operating model is incomplete without a comprehensive RACI metrics. So everyone on the team understands who is responsible, accountable, and who can be consulted or informed. An example over here would be domain teams. So domain teams are essentially now responsible for data product creation, so they own the data, they need to own the data product that they are producing and putting it out there for the world to consume, as in the organization. Now, I understand that this is a little bit of a cultural shift, because before the data team used to own everything, now it's retail banking who owns customer account data and the risk who owns the credit score data.

So each domain in our situation now is accountable for the data that they produce. What we want to make sure is the first iteration of this flywheel is non-intrusive, and we can always iterate and improve upon this based on requirements.

So the second solution is the multi-account strategy. Now remember, we're talking about a large organization here with multiple business units. So what we want to do is use the existing organization structure and map it down onto what we want to create from a technology perspective here, in this case, to build that central governance. So we're going to take this as the organization structure, and we are going to distribute them, and each of them are going to have their own account. Why? Because it just maintains a very clean isolation from a security perspective, from cost allocation, and from technology freedom. So your domains are still allowed to choose whatever technology stack that they want. And regulatory compliance, right, so you've got clear audit boundaries.

So this is how it looks like in SageMaker Unified Studio. So when you go under governance and you can see the domain units, the same structure that we saw is now distributed as domain units and subdomain units, and each domain unit essentially has its own authorization policies, so you know who or what AD group can essentially create domain units, who or what can create things like projects and products or artifacts, you know, so it gives you that level of control. So that's your first entry into security.

Now next, the third one, which is very important, is around data ownership. Now you remember, we were talking about source of truth problem, right? So how do you solve a source of truth problem is by tagging each data set with the owner that it belongs to. Have you ever heard of data contracts? Raise your hand if you have. Perfect. Amazing. So data contracts are essentially formal agreements or specifications for data exchange between the data producer and the data consumer. It contains attributes like ownership, schema structure. So it gives you more technical metadata about, you know, let's say, the structure of the table that you have, the columns, what are the different, what's the data type, and so on. It also provides validation rules for data quality, which we'll take a look in just a bit, and, you know, if there are any governance requirements. So you can essentially make it as exhaustive as possible. So you want to build a document to tell the world how this data asset is going to be used, how it's going to be governed. And we'll come back to data contracts quite a bit in the next few slides as well.

Now in this situation, what we've done is I've got two data contracts out here. So which one do you think is more valuable? The one on the left or the one on the right? The one on the right. But let's not get too much dragged into the details. We don't want to make it perfect the first time around. What we can do essentially is have iterative improvements. We can start off by just simply saying that, hey, this is the data team that's the owner for my accounts contract data, and later at a different, you know, when you want to reiterate and when you want to iterate into a newer data contract, that's when you can start adding more critical information like, you know, where is the source. So you've already established the ownership, now you've established what's the source of this data. So you're adding more and more information into the contracts. So this is how it generally goes. You never get it right the first time. You keep on adding more and more information as and when you need it. Cool.

Unified Catalog and Cross-Account Access: Solving Discovery at Scale

And let's introduce the unified catalog. So this is essentially to solve our data discovery problem, which is a huge deal when it comes to an organization with many business units. Before SageMaker Unified Studio, the way this was done is there was a central federated data governance account where, you know, the heart was your Lake Formation and your AWS Glue Data Catalog. But this was a valid architecture, right? And we've actually deployed this architecture with multiple customers. But the only problem here is that this solves the problem only for the technical metadata and not for the business metadata. It does not support the business context.

So for that, what would happen is you would essentially bring in a different service, a license service, to give you that business data context. With Amazon SageMaker Unified Studio, what has happened is it now gives you that business data catalog capability. It gives you the capability to add all your business-related information, like business terms and whatnot, and attach it to your data assets. Also, this is a single catalog that spans across all domains. So each domain will have their own Glue catalog, which is a technical catalog, but this is the one that spans across all of them, and you can have your business metadata in there as well.

And last but not least, what about cross account access? We were talking about access and how hard it is to have access across different technology stacks, and you keep on having those custom integrations created. So all of this is now obfuscated by the service. It does Lake Formation under the hood. So it's not, you'll see how simple it is when we talk about barrier number three, which is cross governance. Now, at this stage, we've laid the foundations, we have somewhat broken the data silos. Of course, breaking them requires touching all the other barriers as well, but we'll take a look at that very shortly.

Now, the first three solutions, what we've solved over there is we've solved the source of truth problem, which is a big problem, which is the number one problem for most of the customers, for you. And what has happened now is data owners have become accountable for the data that they are producing. The fourth one, unified catalog, essentially solves the problem of discovery at scale. So the more domains that onboard onto your platform will start sharing more and more data, and the more data that's shared onto Amazon SageMaker Unified Studio becomes more discoverable by everybody else, and they can use that data to build more business value. And the fifth one essentially talks about cross-account access. Now, Amazon SageMaker Unified Studio simplifies that by managing your Lake Formation under the hood. So there's no more reaching out to IT folks like, hey, can you add access for this asset to this team and whatnot, and it just gets lost somewhere. It's not documented, maybe you don't know who's giving permissions to whom.

Barrier Two: Building Trust Through Data Quality

So now that we've broken the first barrier, let's start talking about the second barrier. Now, you can discover data, but it isn't useful if you can't trust it. Your AI agents, remember, AI agents will be only as effective as the quality of the data that they look at, and so are humans, of course. There are four core challenges when it comes to data quality, when it comes to the trust in data. The number one is the data quality issues. Forty percent of the analytics projects fail due to data quality issues. The discovery essentially takes weeks of work, and not before. So you discover that there are problems only after weeks of working with that data set, and there are no proactive quality checks happening at this point.

The second one is lineage, or traceability, or answers to questions like, where does this data come from? If your AI agent is going to go and look at data and give you some outcome, and you would like to trace it back to where the data source for that decision is, you want some traceability on your data. So what happens when the trust is broken down? When you don't trust your data, essentially what happens is it perpetuates the same problems that we saw in barrier one, and you would continue in your silos, and it becomes harder to break.

So from a barrier two perspective, the solution flywheel looks like this. We're looking at data quality, we're looking at solving data quality, we're looking at solving the data lineage, we're looking at solving PII and compliance and access governance. So missing data quality impact. Now, imagine you work months on a project, only to realize that there are issues like this, what you see on the screen, right? Would you

be able to build good quality data products and let end users use it? The answer is no. So in a data pipeline, where do you see this addressed? Your data goes through multiple iterations and multiple transformations, and the quality is essentially taken care of after some cleansing of the data. So what we can do is we can use the data contracts. We come back to the data contracts, and we can use data contracts to create these data quality rule sets. We can see a set of four rules here, but we can have many more. So we can send this information through the data contracts and then have it processed at a later stage using AWS Glue Data Quality.

So with your AWS Glue Data Quality, you have the ability to run these Glue quality checks and get a score out of them. You can also perform machine learning powered quality checks. So Glue essentially analyzes your data and suggests these quality rules and learns normal patterns and detects anomalies. So how does this relate to SageMaker Unified Studio? Once you've collected your Glue data quality, when you basically submit this or publish this asset to SageMaker Unified Studio, you can see this when you look at the asset. You can see the data quality scores and detailed information. So if you're a consumer, you're going to look at this and see that, oh hey, yeah, 80% is good with me. I'm happy to use this.

Data Lineage, PII Detection, and Access Governance Solutions

The lineage impact addresses questions like, where does this data come from? How do you reach this decision? So to solve data lineage, what we are essentially doing now is we can simply enable lineage capture from a Glue job. All you need to do is enable the generate lineage events and provide the SageMaker Unified Studio domain ID, and that's it. All your Glue jobs are automatically generating these lineage events and sending it to SageMaker Unified Studio. One thing to note here is that the lineage capture is only available from version 5 of Glue. So if you're on version 4, think about upgrading. You can do this through code as well, so it's just like a configuration parameter that you need to set, and it's going to do the same thing. It's going to enable it and it's going to stream it to the DataZone domain ID.

But Glue is not the only solution when it comes to your data pipeline. You would also use orchestration tools like Apache Airflow, which is very common nowadays, and Amazon's Airflow, which is MWAA. So you've got support for open lineage Airflow plugin, where you simply enable that plugin within your Airflow, and all you need to do is set the transport type to Amazon DataZone API. So this is a new addition to Open Lineage. This was released back in June 2025, and all you need now is a post lineage event access to post lineage event API through IAM Roles. The same on DBT. So DBT again, very popular ETL tool, and it supports the exact same transport type to DataZone. So essentially, all you need to do is you set the transport type to the DataZone API and provide the domain ID, and DBT does the rest of it. All you need to do now is use the DBT OL command instead of just DBT to run your ETL transformations. The same permission requirement for post lineage event API.

How does this look like on the SageMaker Unified Studio? So you've already captured it. You can see a comprehensive lineage graph that has been captured by your Glue. It has been captured by MWAA and it has been captured by DBT. So weeks of investigation now is sorted out just within a couple of minutes because it's already available.

So unknown PII exposure. Imagine a security team does an audit and identifies the Social Security number and they're just doing some kind of a keyword search. But how often have you seen that that's not really the case? You will see where customer SSN numbers are stored in columns like customer identifier. So it's not the real representation. The keyword search is not the right way to do it, and compliance violations are just waiting to happen. So in your data pipeline, where do you solve for PII detection? As soon as the data lands onto your data platform, that's when you do your PII detection. And how do you do it? Again,

let's get back to data contracts. So this is the best way to classify, because now, as a data owner, what I would do is tag each of these columns, whether they are PII yes or not, right? And they can be different other classifications depending on the organization. You can have criticality and whatnot, right? So all of that is something that you can pass through data contracts, and this will be used as metadata to update when you're publishing your assets.

You can also use AWS Glue visual ETL. So if you're not doing the data contracts, you can also use visual ETL to detect sensitive data. So again, AWS Glue uses machine learning to identify the sensitive data. So you can scan the barcodes, the QR codes, and that will link you to the AWS blog where you can see how this really works. And the last one for this barrier is, of course, solving for access governance, which is answering the question, who gets access to which part of data in the pipeline.

So when the data is raw, that is just ingested from the source, you would use Lake Formation RBAC and have limited access only for data engineers, operations, and compliance and audit teams. So this is where they'll go and look for PII data, right, early in the pipeline, not at a later stage. The next one is the silver layer where you'll have a little bit more of a broader access. This is more for cleansed data now. You have some data quality as part of this layer, so you can have access given to data engineers, analysts, and scientists.

And the gold layer, which is more business-aligned layer, is where you can have even further fine-grained access control like row and column filtering. And you can, you know, this is where the BI analysts, developers, and again, scientists and ML engineers get access to, and also the business owners because they need to create these dashboards. Cool. So now that we have established trust in data, we are ready to start looking at the cross-organizational governance. And let me hand it over to my friend Sam here, who will take you across the remaining two barriers.

Barrier Three: Cross-Organizational Governance Challenges

Okay, nice to see you all again. It's been about what, 30 minutes? So for my section we're going to be talking around discoverability, metadata enrichment, and of course access to the data that we're producing, and then finally at the end, a couple of consumption methods. If you're more interested in the governance side of things, probably going to be barrier three. Barrier four, sorry, barrier four is more how we can use this data afterwards. Okay, so up until this point, we've defined our business units and we've modeled them in a catalog. We've onboarded our data into the cloud, and we've started to improve the quality of what we have through some of the later releases that Rizwan just talked about.

But the issue we still have is, as a consumer of that data, I don't know what's there. It's sitting in a catalog, wonderful, it's all shared, or maybe I can find it, but I don't know what it is. So one of the challenges I have in my role as a data scientist is I come along and I discover that I don't have the data sources on day one. It's impossible to work, yeah, so I have to request access, coming back to the challenges that Rizwan mentioned.

Then I actually get a copy of the data, and it could be weeks, months, or years old, or it may not be reflective of the actual data being stored in the systems. This takes a lot of time. And then, unfortunately, it's a CSV so there's no descriptors, there's no information, I have no idea what it is I've actually been provided. So there's so much back and forth with SMEs, it's really time consuming.

Okay, so once again, scenarios that will hopefully resonate with all of you, I'm going to judge this by the nodding heads, but right now it's just everyone taking photos. But discovery is really difficult in a larger organization. With smaller teams, it's relatively simple, because you're all neighbors, yeah, probably go out after work together, you're really close, you know what each other does. But as your teams get larger and larger and larger, no one has any idea what each other is doing, which means it's really limiting when you want to produce something new. When you have a great idea, you don't actually know what data is available for you. This makes it hard.

Also, as we said, requesting access to data in a larger organization means raising ServiceNow requests. I've raised them, and they just don't get attended to. Sometimes they sit in the ether, and it's awful. You end up having to chase people in instant messaging and everything else, and it goes through the escalation chain. And there's even then, there's no clear visibility across all teams or, you know, higher ups to see that this is happening.

Third one here as we mention, no business context, right? So once again, my job looking at CSV, looking at all sorts of data, and I can't tell you what it means. The column names could be awful, there might be additional columns I don't need, and even as Rizwan mentioned, I could be getting access to things like PII. I definitely don't want that. And the last one here is there's no way to give feedback as well. So there's no way for me to go back to them and complain about all these things that I've just raised.

Cross-domain projects are still incredibly painful. Everyone nod your head. Do we all agree? Yes, at least a couple of nods, wonderful. So my four key challenges for Barrier 3 are unified discovery. Riz touched on it before. The second one is around managing access requests in an automated fashion. The third one is trying to solve for business context, and the last one is around standards. Because we are federated teams, we don't generally have standards for how we do things. It's just how we work, and it's really hard. So what we want to try and do is at least in the data world, we want to start solving for this.

We have shown a lot of flywheels, and we believe they're great because they force you to go back to the start every time. Just because you think you've solved it, no, you've got to go back and you've got to iterate. And with data you're continually onboarding it, right, so it means you're starting the cycle each time. So once again, we need to make sure any data we produce is discoverable. We need to make sure people can access it. We need to ensure that we describe it to the downstream consumers, and of course we need to do that with standards that they can follow.

Coming back to this diagram, I know we've shown it a few times, but now we have a few more tools down the bottom that we didn't show previously. We're going to be using IAM Identity Centre for authentication, and this is great. So that means when you want to access any of this data as a human, you have a consistent way to do so. If you're doing it as a system or as a process, obviously we're relying on IAM Lake Formation and other permissions internally. We have resource provisioning, which is more around how we want to blueprint services. We're not going to touch on that one as much today, but it's really important in this space.

Domain units, which as Riz mentioned, we're going to be identifying at the start, we have talked about that. Membership management is going to be the key one, which we'll show some examples of. And the last one here is around pub sub workflows, and the pub sub element is more around how I can request access to something. It raises notifications to downstream teams, and they can approve it either conditionally or without condition. On the left you'll see we have some producers. On the right we have some consumers, so once again showing that all parties are playing together in the same story.

Solving Discovery, Access Requests, and Business Context

So let's talk about how we solve some of these things. Now, Riz mentioned the federated catalog is pretty much the heart of everything we're talking about. It's really powerful to be able to search with something like natural language and find everything across an organization. And most organizations at the moment are trying to do things like create internal wikis, and it's great and it's really powerful, but it's only as powerful as what you put into it. So it's that same kind of concept. Now the way that works, as Riz mentioned, is we've got Amazon DataZone at the heart, which is actually what's allowing us to do the federation between all the onboarded AWS accounts, and that's connecting to the AWS Glue Data Catalog in each of those accounts.

So coming back to ownership, each tenant, be them a producer or a consumer, is responsible for their own data in this sense. There's still enough governance and overlay to protect our business, but it's still our model. The second item in here, well sorry, the third one because I haven't indented it, is the business glossary. The glossary is really important. If I just put a CSV into my catalog, it's really not consumable. So the glossary is all around identifying taxonomy and terms that are important to my business that I can then tag to assets or even tag to column level within those assets. Really important.

We want to be able to search with those business terms. There's no point searching by customer underscore ID underscore fintech underscore A. It's not helpful. So we want to be able to actually search this taxonomy that we're creating. And the last bit here is a bit of a distinction around data assets versus data products. Now, really high level because we all talk about it, a data asset is something that we've produced from maybe a system, we've done a little bit of refinement, and we've onboarded it to the catalog. Great. It's something I can use, but intrinsically there may not be any value in it because someone has to do something with it before the business goes, yes, I've now realized that value.

So it could be as simple as here's a transaction report, wonderful. But until I take that transaction report, maybe dashboard it, maybe refine it, maybe build an ML model over it, I don't have that data product yet. It's still just an asset that we're working with. So this is what the experience could look like. I'm logged on. I'm going to be browsing to my catalog. And from here, when I'm browsing, you'll see there's a number of items that are coming down. Sorry, the items are on the right, the filters or the facets are on the left, so I can specifically search by certain product types that are in here, be them dashboards and so on. But also on the left I can search for those data asset items as well.

So now that we've sort of talked a bit about those challenges, I want to come back to them for the second one, and this is around access request handling. So in this model here, you'll see at the heart this is all around that concept of pub sub workflows. Think ServiceNow, if you've used that system before. It's the same idea. Someone raises it, someone responds to it, everything goes well, assuming of course people acknowledge these requests. So what we're doing now is rather than relying on emails, instant messaging, and really tense phone calls, we're trying to have this in an automated way that we can actually have an overview for as well.

This approach is all embedded in the cloud rather than external in another system. So what could this look like for an end user? From here I've found my asset and I've subscribed to it. I can provide commentary. That's it, it's as simple as it is. Now we're going to switch personas, and now I'll be the owner of that original item. I can come in, I can see the incoming requests, and see information about it. I can see exactly what they're after, and I can say great, you've got access.

So that whole process, if your teams are diligent and they're paying attention, it's five to ten seconds versus days, weeks, or months as Riz mentioned earlier. It's great that I can approve access to an entire dataset, but sometimes we don't want to grant access to everything. There could be fields or properties or columns that we don't want to expose. So when these requests come through, I can approve it, but I can approve it conditionally. Maybe certain columns are available and certain ones are not available.

This is really powerful. Think back to that personally identifiable information example that Riz gave before. I could have this data, maybe it's sitting in my catalog, maybe I need that personally identifiable information related data for something of value to me, but I absolutely don't want to expose it to someone else. I'm creating these extra points where things can go wrong. I'm happy storing it, but maybe another team doesn't have compliance certification or some appropriate level that grants them access to this as well. That doesn't mean they can't have the data, it just means they can only have the parts that are relevant.

Solving for business context. In this case here, this is where we want to create a taxonomy that's important to you. It's no good that I can search for CSVs, it's no good that I can search for the name of that CSV or a database table name or a property within the database. This is not helpful, especially if you're using systems like, and I'm not putting these systems down, but systems like SAP because of the way they do the column naming. You either have to ETL it into something that's discernible by a human or leave it as is, but if you onboard data like that, it might be really hard to understand what it is. So if I own that and I'm onboarding it to the catalog, I'm then responsible for making sure it's usable by downstream consumers.

In this case here, I'm going to be adding some glossary terms, and you'll see that these glossary terms are very relatable to the financial services industry customer here. These are all things like what business unit you're a part of, it might be particular projects within that business unit, or it might even be the names of applications or things that people are familiar with within that particular organization. Once we've created all this taxonomy, we can then assign it to the particular data assets that we were talking about or the data products as well.

Going even further again, we also want to be able to do it to a column level. Now, Riz mentioned a really good example about social security number versus customer ID versus whatever else, right? Different business units represent their data in different ways. I call it a customer ID, someone else calls it customer IDs and it's an email instead of the ID. This happens because teams aren't working together on the same projects. They're federated in nature.

So what we can do here is we can actually create identifiers or taxonomy within our business, and then we can link that taxonomy to specific columns. Now I can reliably say what is a customer ID versus a customer email versus a customer social security number, but it relies on the fact that we are all responsible and obviously we follow processes like this. Of course, this wouldn't be a re:Invent presentation if we didn't talk about AI, but if these sorts of tasks are too laborious for you as well, which generally they are, it's hygiene. We don't want to do hygiene, it takes too much time.

We can actually use things like Amazon Q which is built in, and it will do all that work for us. It can create descriptions for assets or products, it can do a lot of this enrichment around these columns, it can automatically assign, it can look at your data and infer what your taxonomy could be. So it's really powerful if you're trying to understand what you should do, you've got this assistant there to help you as well.

We kind of mentioned the whole idea around data assets versus data products in more detail, and this is what it means. I sort of gave you the TLDR earlier, but this on a page is trying to give you that understanding. Lots of businesses create data assets all the time, once off, twice off, maybe they're not useful for someone else, maybe they are, but they tend to have a shorter lifecycle. Versus on the right, your data products last for a long time. You want these things to last for a long time, you want to update these things, you want your business to come back to these over and over and derive value from them.

In here as well, you'll find that on the asset side of things, it's usually just a schema representation, it's not enriched. As we move through and it becomes a product, all of a sudden this becomes something we want to reuse, we want people to use it, we want to be proud of this. And the last two statements at the bottom, of course, data assets typically think of like a CSV. It might be emailed to you, it's up to you if you use it.

It's probably a copy of data from somewhere, maybe it's a section of the records, we're not really sure. But as we move to the right, of course, it becomes a data product. This is a team giving attestation to something to say, this is what we're going to keep up to date. Everything's been nicely curated, this is what you need to use for that problem.

A different representation of that is, of course, if we go from bottom to top. Your raw data is what comes out of the system, surprise surprise. As we promote it into a data asset, we start adding a little bit of technical information about it. We might start cleaning it using some of the services Rizwan mentioned. And then, of course, as it graduates to be a product, this is where we start adding all these different features that Rizwan mentioned as well. This is where it becomes something you can trust, you can rely on.

Our flywheel for this section, surprise surprise, make sure we have our unified discovery so as we onboard, people can find it. Make sure we have access to it, provide your business context so people can understand it, and of course, we want to adhere to these standards so everyone follows the same way, everyone follows the same route.

Barrier Four: Data Consumption Patterns and Customer 360 Use Case

Today I'm going to talk about the barrier for consumption. All these things are great, and lots of engineering and lots of effort, but if we're not using this to produce something of value, we're missing the point. So using Customer 360, we're going to talk a little bit about churn prediction. It is a topic that a lot of financial services institutions want. In fact, anyone who's offering a product to the market wants this kind of concept and story.

There are three main things I'm going to talk about. One will be a quick experiment, the second one will be managed services that might assist, and the third one will be how can we present the results of these experiments to the business. So if you're conducting any sort of experiment with data, machine learning, even analytics, right, you need to start with a hypothesis. I can't stress this enough. This is something you come back to. This is what validates that what you're doing has actually hit the mark. This is what you tell the business you're doing, and this is what you come back to and say I've achieved that. Otherwise, it's just playing around, it's not of value yet.

And so for us, we've seen that there are some subject matter experts who might anecdotally know some details about their space. Surprise surprise, they work in it every day. So they might be able to help you and tell you what you have in here that's of value in the data. So if we walk through this story, starting with our seven stage experiment, step one, we want to start with data sources. We've created our hypothesis. Now I want to find out, does my business have anything of value for me?

So coming back to that catalog that we talked about, data's been onboarded, it's been enriched, it's become a product, it's something I can rely on, I can search for it now. So all of these assets or products are going to be registered in the catalog. Now from here I can actually have a look at these particular items. I can use natural language to explore them. I can even write my own SQL queries as you can see within the experience, but it's a really good way for me to get involved. Additionally, if SQL is too difficult, let's face it, it's a foreign art nowadays, right, with all this generative AI stuff, so from here I can actually use Amazon Q to create the SQL to interrogate the data as well.

Then we want to load and we want to prepare the data, so now we're kind of moving more into the Python world. So we could be using JupyterLab within Amazon SageMaker Unified Studio. And so from here I can take that idea of what I want or what I know, I can either use an Athena embedded widget inside those notebooks to query it, or I can use boto3 to query it. I can even use SageMaker Data Wrangler to query it. There are multiple ways that I can get it available for me to work with.

We want to split our data. Generally when we run machine learning experiments, we don't run it across the entire data set, we're using segments. So now because I've interfaced with SQL, I now have a large data set. I could split it with SQL, alternatively I can split it with code. Either way, it doesn't matter, that's the screenshot on the left. Then as we move to the right, in our case, we're going to be training a really simple classifier model for this, and so you'll see there's a few lines of code. We usually can't avoid this until we get to a little bit later in the presentation.

Now evaluation, I can't stress this enough, this is when we're going back to that initial statement. I don't know if anyone read the footnote, but there was a percentage. We want to have a percentage that we can come back to in terms of accuracy. So what I'm showing you here is actually a confusion matrix, which is a way that we can represent a classification problem in terms of evaluation. This is really useful. This is also a data product, right? This is something I've produced with that data.

So if I produce a result or a report with the model that I've trained, why wouldn't I include this as well to tell the business this is the accuracy percentage, and this is how good my experiment was. Really important we keep this, we surface it back to the business. Now here what we want to do is we want to validate those anecdotal thoughts we had at the start. Our subject matter experts told us what they thought, they gave us the idea of what data could contribute towards an outcome. We can then actually have a look at feature correlation and go, yeah, wonderful, they were correct. We can show them how efficient they were in presenting that business problem as well.

We then want to execute across our dataset, so we have the entire dataset we can export.

Via SQL using the methods mentioned before, and all of a sudden I can start overlaying things like in here, I've got a 70% marker around the center point, because this might be something of interest for that business. If a customer falls within that higher percentage, they're going to leave. If they fall in the lower percentage, I don't have to worry about them. This in itself is a really useful report for someone who's not technical. They could be a product owner, they could be part of the business.

Then what we want to do is we want to take those results, we want to put them in Amazon S3, and we want to start the process of onboarding to generate that new data product. Once again, coming back to the flywheel, and that's what we're talking about here. So we've put it in S3, we then want to put it back in the catalog, and now anyone else within the business can see that data and use it for something. Thus completing that flywheel. Once again, that's just one rotation of it, you're going to do it many times as you come up with new ideas.

Can we make it easier? We absolutely can. And at re:Invent there's also some features that have been talked about with Amazon SageMaker Unified Studio that show even easier ways, again, please keep that on the lookout if you're interested. But what we can do in Data Wrangler is we can actually reference that data catalog item that we've created, or we can export input. Either or, depending on the flavor you want, we can stick it inside Data Wrangler, and we can see a representation of the data, high level.

We can create workflows of what we want to do with that data, and these could be things like machine learning tasks around one hot encoding, it could be splitting data, it could be joining data, all sorts of tasks related to getting that data in a state that I want to use it in. We can perform transforms, these are those things I mentioned before. On the right there is actually a list here, it's natural language search, we can find a whole heap of them and then it will give us this overview of maybe how the data was affected from it.

Analysis at the end, this is the final result of what your data looks like, just pre before we train the model. So in here it's telling us the types of data we have, it might be telling us how many fields in like categorical columns, it's giving us an overview of what we've produced or what we're about to work with. From here, it's one click to build a model. Obviously not all models are supported, but for a classification problem, it's one click to build a classification model. You'll see in this confusion matrix, the results are pretty much the same as the data science experiment that I ran myself, but I didn't need Python this time around. So there's some really easy ways that we can use these items and be really smart and creative in getting outcomes.

Consumption, this screenshot's really boring, I do apologize, but now that we've got our dataset, we've got our model, I can execute across the whole thing, now that I'm comfortable with the accuracy of it. But we want to present all this to the business. I've done some good work, I've learned something, either done it the hard way, the semi-easy way, and there are easier ways coming, but I've produced something of value and I now need to tell the business, look at me, it's time for my promotion, right?

So, in Amazon SageMaker Unified Studio, when we find these data products or data assets, there's a very quick way that we can promote them into Amazon Quick Suite, very quick, it's just one click. Yeah, and Amazon Quick Suite is actually part of the resource provisioning stage that we talked about earlier. This is all blueprinted and it comes with the product. So it's one click onboarding.

From here, we can then create an analysis of our data, so this is a representation of the customers who we just predicted are going to churn. It's nice and simple, I haven't done anything with it, and it's just presented there as a widget. Now on the right, if I don't know enough about how to do this, I can use natural language to create this as well. I can then populate that analysis to a dashboard, all of a sudden this is looking pretty businessy. This is what I might pass up the chain. In here you see a combination of markdown widgets, tables, et cetera, et cetera.

Now with those dashboards, once again, they are data products, they can then be shared to other teams downstream. I can create things like executive summaries of them, which is a really high level view, the TLDR of all the data I'm looking at. I can create data stories, which is where I can use generative AI to actually create a sequential walk flow of what I'd like to present to someone. And then scenarios are where I can use natural language to pull in data, move data and reflect what I'm asking, or sorry, what I'm asking for.

And so the wrap up here, right, is all of these different barriers we're talking about, you have to fly with them, you have to keep going, it's this repetitive cycle. So, I hope you've enjoyed the flywheel and the way we've put this through, right? We try to make it as simple as possible and easy to consume, right? So, what we've seen now is all the data mesh barriers, and once you apply these flywheels and the solutions that we've shown, these barriers now essentially become enablers. So this essentially enables your data mesh and your cross-organization data discoverability, data usability, and consumption of data. Your AI agents are going to do far better when they have access to data across your organization, and high quality data at that.

Let me share some of the key learnings from my experience. Number one, we spoke about the first barrier where the source of truth and data ownership takes precedence over everything. Unless and until we have that shift in the way we work, where data owners basically take most of the responsibility of the data that they are producing, then nothing's going to work. That's number one.

Domains essentially are the owners of data, but the platform teams own the standards. We have to segregate those responsibilities. Start small, prove value, and scale. Start with a use case that is relatively simple but with high business value so you can quickly keep showing the business the value, and you can just keep on repeating this in a flywheel fashion.

The last is, of course, your consumption patterns. The better consumption patterns and the more simplistic they are, you basically increase the value for the AI consumption and experimentation. You need to keep experimenting on the data that you are essentially producing. If you're just creating data products that nobody is going to use, then there's no point really, which is why you should start small, look at what business value you're going to add, and then work backwards from there.

With this, we'd like to thank you for your presence for the session that you've attended. That's it. Thanks for sitting.

; This article is entirely auto-generated using Amazon Bedrock.