🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Enabling AI innovation with Amazon SageMaker Unified Studio (ANT352)
In this video, AWS and Commonwealth Bank of Australia present their data transformation journey using Amazon SageMaker Unified Studio. Raghu from AWS discusses the importance of data foundations, covering seven data literacy actions including building data inventories, data annotation, PII detection, and establishing data culture. Terri Sutherland explains how CommBank migrated 61,000 on-premise data pipelines (10 petabytes) to AWS in nine months, implementing a federated data mesh architecture serving 40 lines of business through their CommBank.data marketplace. Praveen Kumar demonstrates the technical architecture, showing how Amazon SageMaker Unified Studio enables multi-account data governance, self-service data discovery, and integrated workflows for data engineers and scientists. The demo illustrates end-to-end processes including metadata enrichment using agentic AI, data quality checks, and building fraud detection models across distributed AWS accounts while maintaining strict governance controls.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction: The Critical Need for Data Foundations in AI Innovation
Thank you for coming to this session. It's so early on a Monday, and it's almost like all of us here are kicking off re:Invent 2025 together, so kudos to all of us. My name is Raghu. I'm a Principal Technical Product Marketing Manager at AWS, and I've been with the company for almost ten years in a few weeks. Welcome to today's session, which is Enabling AI Innovation with Amazon SageMaker Unified Studio. I'll kick things off, and I have two other speakers here today. Terri Sutherland is the General Manager for Data Platforms at Commonwealth Bank of Australia, or CommBank for short. She's going to come on, and then we have our star Principal Solutions Architect, Praveen, who is also from Australia. I'm very thankful to Terri and Praveen for coming here all the way from Australia to do this session with us.
I should have done this first, but this is today's agenda. Like I said, I'm going to kick things off and set the stage, and then I'm going to hand this off to Terri. Terri is going to walk us through CommBank's journey using SageMaker Unified Studio and some of the other capabilities within SageMaker, including the problem statement, outcomes, and impact that they have had at the bank. Then Praveen is going to come on, and he is going to walk us through the architecture that they built within CommBank and also do a demo. This is going to get very technical once Praveen comes on, and I'm sure those bits are going to be some of the more exciting parts of today's presentation.
We're going to start off with some data and research. This is a research analysis done by Harvard Business Review this year, and basically what they are telling us is that 89% of CDOs have an initiative to build applications using AI and generative AI. The majority of them are building applications with AI and generative AI, but an overwhelming 52% of those CDOs don't think they are ready to embark on the AI journey. The reason for that is they don't have enough trust in their own data. They don't have enough faith in their own data because they don't know their data.
As much as we all want to do exciting things with AI and generative AI, agentic this and agentic that, I hate to break it to you, but we kind of have to go back to the basics and get our house in order with our data. So what does that mean? It means building a sound data foundation within the organization. At AWS, we look at this as two very broad things. One is improving data literacy within the organization, which is, do you know about your data? Do you know what good looks like? These are not easy questions to answer. So improve the data literacy within the organization, and the second is build a data culture. We will look at these two parts, a brief overview of what this means, and then we'll go to the CommBank story.
Seven Data Actions for Improving Data Literacy Within Organizations
I may have spent the majority of my time here at AWS doing this with our customers, and I believe that most folks who are here can relate to many projects that you may be currently working on that do this exact thing. There are seven data actions as part of improving data literacy within the organization. This is not an exhaustive list. This is some of the main items that most customers do to improve the data literacy within their organization. Let's just kind of go through this one by one, twenty to thirty seconds each. Build a data inventory. What does that mean? It can mean different things, but broadly this is what it is. It is a directory, a list of all your data.
This includes tables, S3 buckets with structured data, S3 buckets with images, and also applications like dashboards and models. You should build a list of all that data. We call them data assets. This in itself can be a pretty elaborate topic, but I'm going to just do this really quickly.
You can build this a few different ways. One is you can start at the bottom with the data and build an inventory of tables, S3 buckets, and so on. Some customers like to do it from the top, which is from the application side. They like to build an inventory of their BI dashboards and models and then work their way down. Some customers just start with a line of business, figure it out, and get a process going in terms of building that inventory, and then rinse and repeat that from line of business to line of business. Commonwealth Bank is going to talk a little bit more about that. So that's what building a data inventory means at a high level.
Second is data annotation. This is probably the most boring part of the whole journey, but there is generative AI to help all of us. What does that mean? When you build a data inventory, what does it contain? It just contains metadata, cryptic table names and column names, and no one can really make sense out of it. Neither humans nor agents can make sense out of technical data, and in most cases, we will all get it wrong. So data annotation means describing technical data and giving it business context. This is a customer table for product X, this is a regional customer table, and so on. Someone or something will have to go and annotate this data. Like I said, not very exciting, but very important.
From there, detection of PII data. If you're going to be doing modeling or training, or you're retraining a foundational model, you do not want PII data as part of that project. So detection of PII data is super important. Once you detect PII data, what do you do next? You categorize it: most sensitive, least sensitive, and so on, that fits your company's security SLAs.
From there, data quality. How good is this data? We all talk about bad data. How bad is it? So data quality is part of building the data literacy within the organization. Then lineage. Most people here probably know what it is: data origin, data transformation, where did this data come from. That's being tracked.
And then finally, publishing new data assets. So this is one of the areas where I have seen customers struggle with. You do all of it, but your data inventory is not static. Your data inventory is ever changing. So you have to build the practice where when you do all of this, which is a significant investment, new assets are created every day. You've got to put it back in the inventory. So it's a practice, it's a habit. Many customers do it really well. A lot of customers are still trying to figure it out.
Like I said, I've been doing this for a long time. And I just want to leave you with this and a couple more slides. Choose your own journey with this in improving the data literacy within the organization. What I have on the screen is just a recommendation, it's a suggestion, but choose your own adventure in improving the data literacy within your companies. Most companies start with PII because they want to de-risk, then they go to data quality and then inventory, and so on. So pick your journey and then stick to it. If you take a step back and look at it, even if you did four or five of these things, you will be significantly improving the data literacy footprint within the organization.
Building a Data Culture: The Producer-Consumer Model
So that's data literacy. Second, building a data culture. The best way to explain this is that we all have learned to be very social digitally with Facebooks and LinkedIns of the world. We all have built a pretty good social culture. The same thing applies to data culture. I'm just going to draw some parallels here. When you want to make a social platform, whether it is for videos or whether it's for data, you really have three things.
You have producers, content creators. You have consumers who consume what the producers create, and they like, comment, and use. And then there is a third party, which is the central team. The central team provides you with the platform. They provide you with guardrails, policies, best practices, do's and don'ts, so the whole activity of being social is sustainable.
Commonwealth Bank built this, and they will walk you through their story, how they started, and how they built once they got it right, and how they expanded their expertise and built the data culture within the organization. So this is extremely important. And again, when you build a data culture, it really needs to be in the organization's DNA. It needs to sustain executive churns. So these things take time. You have to build it over time, and you've got to make that a habit within the entire company.
Okay, so we talked about data foundations, we talked about data literacy, we talked about how you build a data culture, and none of these things had anything to do with AWS services. These are just good data practices. So how can we help you? If this is important to you, if PII detection, building a data inventory, building the data culture is important to you, well, how do you get started?
Amazon SageMaker: A Three-Layer Data and AI Platform
Well, last re:Invent, re:Invent 2024, we released, we relaunched, I should say, Amazon SageMaker. It is our data and AI platform. It's got lots of features and functions, and we broadly look at Amazon SageMaker like a three-layer cake. We'll go from bottom up. At the bottom, we have the lakehouse architecture, all Iceberg-based data architectures, Iceberg REST Catalog functionality, a lot of managed capabilities built into our lakehouse. So we've got sessions on lakehouse, so definitely check those out.
And on top of it is the data and AI governance. Most of what I just talked about in the last 10, 12 minutes fits into this data and AI governance capabilities built in Amazon SageMaker. And at the top, we have unified data experience. Well, what does that mean? So you've got a lakehouse, you have good data governance capabilities, which means you can search data, you can find data, get access to data. Then what do you do with it? Well, that's where the Unified UI comes into picture.
SQL capabilities, data processing, building data pipelines, generative AI app development are all part of our unified user experience. But obviously we are not stopping there. We are building more capabilities like streaming and BI and even search analytics, and some of these capabilities will come in 2026. So this is just zooming into the Unified Studio experience within SageMaker.
As you can imagine, this is meant for data workers, whether you are doing SQL, whether you're doing modeling, training, building pipelines. We have notebooks, as you can imagine. We have notebook experiences built into the studio. We have query editors built into the studio. We have, this is my favorite, visual low-code, no-code experience built into the studio as well, which means you can build data pipelines, you can build ETL jobs with absolutely no code. And then we also have generative AI chatbot building experiences as well.
And then it's my last slide before I get Terri on the stage. So this is the second layer of Amazon SageMaker, and this is, like I said, most of what I discussed, which is data literacy and building the data culture within the organization, are all part of the data and AI governance functionality within SageMaker. You can do PII detection, you can do data quality checks.
We have an awesome business data catalog that is awesomely priced, almost free. It's got great capabilities, so definitely check those out. Hopefully you start building your amazing data practice on Amazon SageMaker.
Before I get Terri on, I do want to mention one thing. You saw the three layers, and I should have said this a couple of slides ago. You don't have to use all of it all at the same time. You can start your journey anywhere. If data governance is the most burning problem for you, then you can start with data governance. If lakehouse is your priority, then start with lakehouse. You don't have to use everything. We truly believe if you use all of it, you will get the best experience, but you don't have to. You can start somewhere and then expand your overall usage of Amazon SageMaker as the needs come.
Commonwealth Bank of Australia's Data and AI Strategy: People, Safeguards, and Technology
Terri, thanks folks. Oh, thank you. Thanks very much. Hi everyone. Thank you, Raghu. Really pleased to be here. My name's Terri Sutherland. I'm the General Manager of the Cloud Data Platforms at the Commonwealth Bank of Australia, and I'm super excited to be here today to talk to you about our ambitious strategy and AI roadmap and transformation journey.
But first let me tell you a little bit about the bank itself. Though many of you have probably never heard of the Commonwealth Bank of Australia, or CBA as we call it, we are Australia's largest bank. Australia's population is 27 million people, and CBA services 17.5 million customers. That means one in three Australians and one in three businesses bank with us. Overall, 50% of transactions go through CBA, and on the global stage we are the 13th largest bank by market value. I'm excited to say that we've just been named top four bank globally for AI maturity. Oops, sorry, didn't realize you hadn't moved the slide.
Given the importance of CBA to the Australian economy, our data and AI strategy is one of the bank's most valuable assets. It underpins everything we do, from protecting our customers against fraud and scams to delivering seamless personalized experiences. Whether you're accessing cash at an ATM, paying for groceries, or applying for a home loan, data and AI powers the services that our customers rely on every day.
At the heart of this strategy are three core pillars. Firstly, and always most importantly, our people. Like many organizations, we realized central engineering and AI teams just couldn't scale. So our first step of our strategy was to decentralize and embed our data engineers and data scientists across the lines of business in the bank. By doing this, we brought the data and AI closer to the people who use it and closer to the people we serve, our customers.
Our second pillar is safeguards. We're a bank, we manage sensitive customer information. So we've designed governance and controls to be inherent in every single stage of the data and AI lifecycle, ensuring safety by design. Finally, our third pillar is technology. This is where our partnership with AWS comes in. We needed to take decades of rich data spread across hundreds of source systems and put it into the hands of our federated data and AI teams.
To achieve this, we established a data mesh ecosystem that empowers our federated teams to operate independently. It moves data seamlessly, ensures access for people and machines, all the while enforcing strict governance. We call this ecosystem CommBank.data. Today it provides our 40 lines of business across the bank the freedom to produce and use data all within a trusted, traceable, controlled framework. By decentralizing, we adopted a clear producer-consumer model.
Each business unit now owns and manages this data as a product with defined roles and responsibilities. We also introduced self-service data sharing through a unified data marketplace, a single pane of glass where users can discover, request, and consume data across the entire AWS ecosystem. But here's the reality: data has gravity. Where the data lives is where every single data engineer and data scientist in the bank will work.
Historically, Commonwealth Bank of Australia had decades of rich data held in on-premise platforms, platforms that lacked interoperability or could not scale for AI. And so we made a bold move. We migrated 61,000 on-premise data pipelines to our AWS mesh ecosystem. That's equivalent to 10 petabytes of data. The migration took us nine months with 100% of data pipelines tested at least three times. That's 229,000 tests. And in so doing, we moved our entire data engineering and AI workforce to AWS cloud.
CommBank's Dual Transformation: Migration and Marketplace Implementation
To make all of this real, we had to run two major programs in parallel: migration and marketplace. We migrated over 10 petabytes of data from on-premise to AWS cloud and built the data marketplace aligned to our federated, decentralized operating model. We started with migration. Earlier last year, we kicked off a series of workshops with AWS to test our most complex data flows and AI use cases to see if the migration to AWS native technologies was possible. We call this approach steel threads.
Like an MVP or a proof of concept, steel threads proves the technology fit but also productionalizes the outcome. We built AI and generative AI that transformed code, checked for errors, and tested output, reconciling every single table to our on-premise platform 100%. Every single row, column, and number had to be accounted for. But this was more than a tech build or migration. It was also a major change management effort.
We onboarded federated data teams through 200 tailored sessions and trained over 1,000 engineers, embedding change, building capability, and driving sustained adoption. And throughout, we also worked closely with local and international regulators to ensure compliance and continuity. Now let's talk about the second major stream of work: building the data marketplace. This was a fundamental shift in how we think about data ownership and access across the bank. Each line of business was becoming a producer, a consumer, and in many cases, both.
We started by building our technical foundations, a federated data mesh platform designed for scale, governance, and decentralized ownership. As we transitioned from a monolithic to a composable platform, we faced the challenge of how to seamlessly connect hundreds of line of business owned AWS accounts while keeping the user experience frictionless. So we implemented an abstraction layer, one that could unify access to data, offer flexibility in compute and UI choices, and uphold our rigorous governance standards. Praveen will bring that experience to life in a demo shortly.
Once that was all in place, we implemented AWS DataZone to enable discovery, access, and sharing across the organization, bringing on early adopters. And with the launch of Amazon SageMaker Unified Studio earlier this year, we added in a single pane of glass experience so everyone from analysts to executives could see and use the data they needed. In completing this, we aligned our engineers and onboarded producers and consumers to a shared framework, one that creates self-service while maintaining governance and interoperability across the mesh.
Transformation Benefits and Key Lessons Learned at Commonwealth Bank
And now with the right data in the right place, we're ready to scale with governed AI. Let's look at the benefits of what this transformation is allowing us to do today. When we spoke to our engineers and data scientists, they told us CommBank.data marketplace has transformed the way they work. Now they connect directly from their local VS Code environment, no longer restricted to SageMaker or remote platforms. This access streamlines workflows and speeds up experimentation.
One of the standout features is a single pane of glass experience. Instead of switching between tools and interfaces, teams have a unified dashboard for data discovery, analytics, and monitoring. The Spark UI is easily accessible, allowing real-time tracking of query performance and quick identification of bottlenecks. Dedicated compute resources also mean workflows run reliably, and troubleshooting is far more simple if problems arise.
The integration with SageMaker was a game changer. Data lab outputs are now available directly within the platform, removing manual steps and making it easier to run advanced analytics and machine learning workloads. These improvements have created a unified environment that empowers our teams to experiment, deploy, and scale AI with ease. It helps us build a culture of AI innovation where our people are closer to the data and scaling AI with confidence.
So we've achieved a lot in the past 18 months. We built a strategic data mesh ecosystem in the AWS cloud. We migrated decades of investment to that cloud ecosystem and onboarded data engineers and data scientists to AWS and innovative tooling. But what did we learn along the way?
We could not scale with data and AI with a centralized operating model and monolithic platforms. We needed to train our people to understand the new federated operating model and allow time for engineers and data scientists to become certified in the new tooling. MVPs, steel threads, and iteration are our path to discovery and in fact the fastest way to true value outcomes for our customers. Importantly, we needed to recognize that transformation is always a learning curve, which means being comfortable with the unknown.
Looking ahead, we'll continue onboarding our lines of business and maturing with them, expanding generative AI, Amazon SageMaker Unified Studio including Amazon Q and agentic AI, and deepening lineage, explainability, and observability. Next, Praveen will show you how this unified experience works end-to-end in Amazon SageMaker Unified Studio for the CommBank data marketplace so you can see the architecture come to life.
Platform Architecture: From Monolithic Hadoop to Federated Cloud-Native Design
Thank you, Terri. It's been a pleasure to collaborate and work alongside the CommBank data team on this very exciting data transformation program. I strongly believe that the team has built a highly scalable and future-fit modern data platform that is going to help accelerate time to insights, increase organizational agility, and at the same time meet all of the strict governance and regulatory obligations.
My name is Praveen Kumar. I'm a Principal AI Solutions Architect at AWS. Over the next 20 minutes or so, I'm going to take you through the CommBank data platform, how we designed it, the platform architecture, and I'll also have a short demo to show you an end-to-end user experience in a multi-account environment, and that's how the CommBank data platform is set up. But before that, I would like to acknowledge the CommBank data team and particularly Olatunde Baruwa, who's a Chief Engineer at CommBank. They have been instrumental in the implementation and the execution of this data platform.
As Terri shared, the data transformation program included two major streams of work. The first one was about migrating our on-premise Hadoop-based data lake platform with over 10 petabytes of data to AWS. The second program involved setting up an internal data marketplace that aligned with the target state federated operating model where each line of business, there are 40 of them, will act as data producer or data consumer or both. And the central team will provide lightweight governance while facilitating the marketplace ecosystem.
So let's look at the migration program first. What you see here is the high-level design and the key components of the on-premise platform before and after migration.
On the left is the setup before the migration. The on-premise platform was made up of two large-scale Hadoop clusters. One of the clusters was used to run 61,000 pipelines, and then data and metadata was copied to the other Hadoop cluster to serve interactive workloads. When this platform was moved to the cloud, it was mainly done through a lift and shift approach. Data was migrated to Amazon S3, and compute is now powered by Amazon EMR.
However, one of the key benefits when migrating Hadoop-based data lake platforms to the cloud is that there is separation of storage and compute. Because all of the data is in Amazon S3, the team is able to enable access of this data to a variety of cloud-native analytics and machine learning engines to support diverse sets of use cases. Next is federation. Previously, the team built, managed, and supported an on-premise data lake platform that was monolithic.
However, in the federated setup, each line of business, and there are 40 of them, have their own dedicated AWS accounts. They use Amazon S3 to store their data in Apache Iceberg format. They built pipelines using either Amazon EMR Serverless or Amazon Redshift or other third-party engines. They use AWS Glue to store technical metadata, AWS Lake Formation for policy storage, and so on. But more importantly, the lines of business are now responsible for building and managing their own data assets, their own data pipelines, and their orchestration. They're responsible for the data quality of these assets, service level agreements, observability, and more.
Amazon SageMaker Unified Studio: Creating an Abstraction Layer for Enterprise-Wide Integration
As the platform evolved from a monolithic unit to a distributed and composable construct, we needed to build an abstraction layer that would bring all of these lines of business together. There are two key requirements. The first one was to build an enterprise-wide business data catalog where data producers can publish data assets and data products, and data consumers can search, discover these assets, and request access to these assets. The second key requirement was to provide an integrated builder experience for all data users, users such as data engineers, data analysts, and data scientists, to use the user interface of their choice. This could be notebooks, query editors, or visual ETL. They would also be able to leverage a variety of compute engines that are optimized for their use cases.
To support these two key requirements, the team onboarded Amazon SageMaker Unified Studio. Amazon SageMaker Unified Studio provides an integrated developer experience for all data users to build data and AI-driven applications. It has governance built in with Amazon SageMaker Catalog. The teams were also able to leverage many of the constructs within Amazon SageMaker Unified Studio for broader governance. For example, they use domain units to replicate organizational hierarchy and implement governance policies. They used projects to isolate workloads and map to hundreds of AWS accounts, and they use the pluggable model within Amazon SageMaker Unified Studio to integrate with third-party engines.
Earlier this year, we migrated the on-premise Hadoop-based data lake platform to AWS. This has almost all of Commonwealth Bank of Australia data. We are talking about tens of thousands of tables. However, this is monolithic in nature. In parallel, we started onboarding various lines of business to the internal data marketplace, and this platform is federated and distributed in nature. The next challenge was how do we bring them together. We did that in two steps.
The first was to identify the line of business owner for each table that has been migrated to AWS within the data marketplace. The idea was to provide the necessary governance. For example, a business owner is now responsible for approving the usage and access of that table or data asset.
The second step was to enable the technical integration, and this goes back to my earlier point. Because all of the migrated data is in Amazon S3, we built a lightweight ETL job that replicates metadata from RDS Hive Metastore to Glue Data Catalog. So you have all of the tables now appearing in Glue Data Catalog, and then we mapped each of those tables to the respective SageMaker project under their LOB. Now, since all of the data is available in the centralized SageMaker data catalog, data consumers can come to a central place, which is the studio. They can browse, discover, and request access, and once the access is approved, they can build their applications through a query-in-place architecture across hundreds of AWS accounts. So there is no data movement. Data can be living in many of these AWS accounts, but we're able to use compute from any of these accounts with the right governance.
Live Demo: Building a Fraud Detection Model in a Multi-Account Environment
All right, so now it's time to see all of this in action. The demo is partly inspired by how Commonwealth Bank's data platform is set up, so we'll be using a multi-account environment. The demo use case is building a fraud detection model for a financial services organization, and we'll have two user personas in this demo.
We have Samantha, or Sam. She's a data engineer and she's part of the retail banking team, so that's a line of business. She's going to create a new table called customer_profile table by combining data from two existing tables that her team owns. She's then going to enrich the metadata, so she's going to add a README, she's going to add data quality and lineage information. And once the metadata is enriched, she's going to publish it to the catalog for others to discover and request access. She's also responsible for approving access to the subscription requests on the LOB's behalf.
The second user persona is Javier. He's a data scientist and part of the financial crime team. He's going to build the fraud detection model, and to do that, he needs access to various datasets. So he'll go to the catalog where he can centrally discover all of the assets and request access to this new table that Sam has created. Once the access is approved, Javier is going to do some interactive analysis and then build a machine learning model using SageMaker AI API.
Here is the setup. So this is the multi-account environment setup. At the top, in this demo, I have used three AWS accounts. At the top is the shared services account. So this account is owned by the central team, and this is where the SageMaker domain is set up. And then the SageMaker domain is hooked to two AWS accounts. One is the data producer account on the left, and the other is the data consumer account on the right.
So anything that Sam is using in terms of storage and compute is going to sit in the producer account. When she's building a pipeline using Spark or when storing data in S3, that's all going to be in that leftmost account, which is the producer account. Anything that Javier is going to do, essentially running Athena queries to do exploratory analysis or using SageMaker AI API to train the model, sits in the consumer account on the right. This pattern is scalable to hundreds of AWS accounts, and this provides you with best practices in terms of setting up a multi-account strategy. You get workload isolation, you can allocate costs to respective LOBs, and at the same time, you can implement distinct governance boundaries.
All right, I'll hit play. So I'm in SageMaker Unified Studio, and I'm logged in as Sam. So this is the homepage of SageMaker Unified Studio, and you can see at the top section you have things like discover catalog, and you can play with the foundation models through the generative AI playground. So let's, as one of the first actions, browse the catalog.
So this is the catalog view where you can discover all of the organizational assets. At the top section, you can type in a keyword, it does semantic search, and then it will return the results. So for example, I have searched for a table called transactions, and it showed up.
At the bottom half of the page, you can see the domain unit hierarchy, which maps to your line of business hierarchy. In this case, there are only two LOBs: RetailBanking and FinCrime. At the top is the project, which is a concept where you map a project to a use case. This allows you to group compute and data that a team has access to. In this case, I've got two projects: the RetailBanking data team project that is aligned to the retail banking team, and this is where Sam is going to work. I've also got another project called FinCrime, which Javier has access to, and this is where he's going to use that project to train his ML model.
Sam is going to click on that project. This is the project overview homepage. We'll go through the menu options on the left one by one, but at the top of this page, you can see the various project files. This could be your notebook files, query files, and so on. An important point to note is this project role ARN on the right side, which I have highlighted. I can see that there is an account number here, which is the data producer account. This project role helps map what compute and data access the team has as part of this project. In the case of Javier, this will be a different role and a different account.
That's the project overview homepage. Next, we will click on the data on the left-hand side. This shows all of the datasets that the team has access to. Any Glue tables will show up here. In this case, there are three Glue tables that the team has already created. If you have access to Redshift data, that will show up here as well, along with any S3 buckets.
Then there is the Compute tab, which is where all of the compute will show up. In this case, the team has access to Redshift Compute as well as Glue Spark. If you have access to other compute like Hyperport or MLflow Tracking Server, all of that will appear here. The Members section shows all of your team members who are part of this project, essentially part of the retail banking team, and that will show up here.
On the bottom half of the left-hand side is the project catalog. This is where you build your catalog. You bring your table from Glue Data Catalog to SageMaker catalog inventory space, and this is where you enrich the metadata. You can add glossary terms, metadata forms, data quality information, and I'll show you this in a while. For example, this team has already got access to three tables that they have enriched, and this is showing up as customers, customer_accounts, and transactions.
The next step from here is, let's say Sam wants to build a new table. They already have access to two tables: customers and customer_accounts. Now she's wanting to build a new table called customer_profile. She comes to the Jupyter Notebook and sets some Glue Spark configuration, like number of workers. She runs some select statements on customer_accounts and customers. At the bottom, there's a cell called create table as, where essentially she combines data from the customers and customer_accounts tables and creates a new customer_profile table.
Once that cell runs, you'll see on the left-hand side that a fourth table will appear called customer_profile. That cell has executed, and now I can see the fourth table. I've created a new table by combining information from other tables that my team owns.
The next step is I want to enrich the metadata of this table. I come to the data source. I've already created a data source that points to Glue Data Catalog, and I'll bring the metadata to the inventory space. I have now run that data source, and I can see the new table is appearing here. Now this is where you can add things like a README. You can look at the schema details, so all of the columns will appear here. Data quality is blank, and we'll populate it in a while. It also shows you the lineage. This lineage is captured as we're building the pipeline. Glue Spark is already integrated with SageMaker catalog, so it's pulling all of the lineage information when we are ingesting the metadata of that table. It shows you column-level lineage, and this is open lineage compatible.
From here, let's say I want to add a README, which is essentially looking at the schema details of this table and looking at a few records of this table to generate a summary. Now you could do this in a couple of ways. You could have your steward manually doing this task, or you can take advantage of agentic AI. When you go to the Jupyter Lab, we have this agentic AI capability, which is Q CLI integrated.
I'm giving the agent a prompt saying that I have a table named customer_profile in my AWS Glue Data Catalog. Can you generate a concise summary of this dataset using schema details and also looking at ten records within the table? I then provide a helper function which has details like using the permission scoped to the project, which is the project role, as well as the Athena workgroup assigned to that project. So I give that instruction. Now this is where the agent is planning out the set of steps and then it's going to execute. It fires up Glue APIs to understand if there's a table which is similar looking and it's able to detect that table. And then it's going to use Athena as an engine to run a select star statement. So this takes about a few seconds, but once it completes, it will save the description, the concise description into a local file within the space. If you want to look at the details, what query it has generated, you can just expand one of the dropdowns and you can look at the query that the agent has generated. So just in a few seconds, you'll see that it has now created a directory. Now it's creating a file and it's going to save. All right, so it has already created the file and saved the concise description based on the schema and the ten records.
Next I want the agent to use this information that I just generated and I want it to update the data asset within the SageMaker Catalog with this information. So update the README. I'm just saying that read the dataset description file generated above as part of analysis of the specified table. Format the content and update the README section of the data asset. And I provide a helper function with an example API. So then it's able to compile the necessities and structure, it self-corrects itself in case it's not able to post. And again, within a few seconds, it's able to create an asset revision within SageMaker Catalog for this asset. So it has completed that activity now. If I go here and refresh, I will see the README section is populated. Now this README section is fairly accurate in terms of what that dataset is about.
Next, we have enriched the dataset with lineage and README. Next if you want to run data quality checks, that's another prompt you can run. I'm saying that I have a dataset and I want to run data quality for this table. I would like you to create two data quality rules, check null values for each column and email validation, and use Glue Spark engine to do that. So again, it goes and plans, it uses the Glue interactive session in the context of the project. It executes the Spark SQL. And right now the data quality is null. But once this finishes running the data quality checks, you'll see that data quality will get populated. And this is where it has finished running the data quality checks and it's saving the results into a file in the local directory.
And so one last prompt I'm entering is, now take this information that I just generated in terms of data quality results and populate the data quality section within the SageMaker Catalog. And this is all running in context of the project. So the scope boundary is that project role that I showed you. So this has completed updating the data quality rules. And you can see here again within a minute or two, you have all of the data quality results appearing in the SageMaker Catalog. So now Sam is happy with all of the enrichment. She has updated README, lineage, and data quality. So she's going to publish this asset, which makes it discoverable to anyone in the organization who has access to SageMaker Unified Studio. And this is that single pane that we talked about.
So this customer_profile table is appearing now. I'm logged in as Javier in a different browser. So you can see it's a different user. And this is the project overview homepage and the project selected here is FinCrime. So it's a different project. You'll see here the project role is a different role and the account number is that data consumer account that I talked about earlier. Javier and his team have access to a different set of tables. So in this case, he has access to a table called transactions. And he needs more datasets to be able to build his machine learning model. So he goes and types for any information related to customer_profile.
He can see that there is a table that has been created and that exists in the catalog. He can analyze all of the metadata, including column details, data quality, and lineage. He can also look at who created this and which team owns it. If he's happy with all of the information, he requests access to this table, which creates a subscription request.
This request goes to Sam, who is the owner of the table. Sam can see there is a subscription request, and she goes and reviews it where she can see details like who has created it and which team it is coming from. If she's happy with all of the details, she goes ahead and approves. At this point, Amazon SageMaker takes care of the fulfillment.
Even though the data is located in a different account and compute is in a different account, it goes and executes the AWS Lake Formation API to do the necessary policy configuration. Now Javier, once he goes and refreshes his technical data catalog, will see that the new table will appear. Now he has access to the customer_profile table, and he'll run a quick select star query. This query, which he's using Amazon Athena for, is running in the data consumer account. He's happy with all of the results.
One final step is to build the machine learning model. I pre-created a notebook, so Javier goes to the JupyterLab. I pre-created a notebook and I've also taken the help of agentic AI to generate this notebook. He goes and runs an Athena query for feature engineering. He preprocesses the data, prepares the data, uploads the data to Amazon S3, and then eventually he trains the model using the Amazon SageMaker AI API. This takes about a few minutes to run, so we won't probably wait until this finishes, but this is the last step in the process. That's the end of the demo. Let's head back to the presentation now.
Key Takeaways: Building Strong Data Foundations for AI Success
All right. Now before I wrap up, I'd like to emphasize a few key points that we covered in our session today. First, building data and AI culture takes time, so it's crucial that you have executive buy-in and that you stay invested. Second, as you embark on your data transformation journey, it's important to start with steel thread use cases because not only does it provide immediate business value, but it also helps with faster iteration and feedback loops.
Third, your data is your unique differentiator, and so it's very important that you have a strong data foundation first in order to get the maximum value from your analytics and AI initiatives. Finally, as you saw in the case of Commonwealth Bank, you can leverage Amazon SageMaker for your data transformation journey, and we at AWS are here to help.
With that, I'd like to wrap up our session today. I'd like to thank Terri and Raghu again for co-presenting and each one of you for attending our session today. I really appreciate if you could fill up the feedback survey that should be on the mobile app. Thank you and have a great rest of the AWS re:Invent.
; This article is entirely auto-generated using Amazon Bedrock.


























































































Top comments (0)