Kazuya

Posted on Dec 11, 2025

AWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

In this video, Sakti Mishra and Radhika Ravirala discuss transforming data architectures for AI readiness. They address the challenge that 89% of CDOs prioritize generative AI but over half feel their data foundations aren't ready. The session covers unlocking enterprise data for AI agents through Retrieval Augmented Generation (RAG) and Model Context Protocol (MCP) servers, demonstrating how to expose data lakes and warehouses to AI assistants. Key demos showcase auto-generating Spark code and visual ETL pipelines using Amazon Q Developer in SageMaker Unified Studio. Radhika details AWS enhancements including Trusted Identity Propagation for single sign-on, Lake Formation fine-grained access control, S3 Access Grants, SageMaker Notebooks with Spark Connect, and the Spark Upgrade Agent that migrates applications from version 2.4 to 3.5. The presentation emphasizes identity-based access control, integrated AI/ML environments, and productivity improvements in data processing engines like Amazon EMR, AWS Glue, and Amazon Athena.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: The Data Foundation Gap for AI Initiatives

Hello, all. First of all, thank you for joining us. I hope all of you are enjoying re:Invent. My name is Sakti Mishra. I work as a Principal Data and AI Solutions Architect, and I'm joined with Radhika Ravirala, who is a Principal Product Manager for Amazon EMR. Today, we are going to talk about data processing architectures that can help you build AI solutions.

Before getting started, let's start with a small quiz, right? Please raise your hand if you feel your organization's data foundation is ready for AI. Okay, I see very few hands raised. I'm sure all of you can relate to this topic. Let's get started.

So this is a very high-level agenda that we will cover. First, we will try to highlight what is the role of data in AI. Then we will try to revisit what are the foundational pillars for a modern data foundation. Then we'll go deeper into how do you unlock your existing enterprise data for your AI agents. And then Radhika will go deeper into what are the enhancements we have done to our AWS data processing engines that are related to security and productivity.

According to Harvard Business Review, 89% of the CDOs are prioritizing generative AI initiatives in their organization, but more than half we interviewed feel the data foundation they have is not ready for AI yet. Now, what does that mean? It means as data practitioners of data analytics who want to transform the AI landscape, we need to take a step back, have a fresh look, and look at the foundational decisions that underpin our AI strategy.

Agentic AI Use Case: HR Onboarding System and Data Requirements

Now, let's understand the role of data in AI with a real-world use case. Let's assume you have an agentic AI HR onboarding system where you are sending a message saying, we have just hired Amy, please start onboarding. And with that, a network of agents are activated through a supervisor agent or through an orchestration agent. The first agent is a task planning agent, which is building a tailored onboarding plan for Amy. The second one is an onboarding buddy agent, which recommends Amy to meet the coworkers based on her profile, role, and background. And the third one is the exception handling agent, which gets kicked in if something doesn't go as planned. For example, maybe Amy's joining date is delayed, or maybe Amy's device gets delayed. Then the exception handling agent can rerun approvals, update the agents to make sure everything is going as per plan. For the HR onboarding agent to work effectively, it needs to get access to the structured datasets, unstructured datasets, vector stores, and real-time streaming datasets with the right level of security and governance.

Now let's understand what is holding our customers back. We can think of top three priorities. The first one is more related to people and prospects. We have different engineering roles in the organization, but now with the AI landscape, the distinguished line between these roles are starting to blur, where the data engineers, machine learning engineers, data scientists, and AI engineers all need to closely work together to build AI solutions. Now, the second part is customers have years of investment to build the data platform they have today, and they need to be agile to make it AI ready. But it's not easy to replace what they have. So how do we enable them to extend what they have to be exposed to the AI agents in an efficient way? And the third is we have a traditional way for the data processing that takes your data from raw layer to the lakehouse through a series of transformations. How do you bring in efficiency in the data processing layer itself that can improve your efficiency and productivity?

Foundational Pillars of Modern Data Architecture with Amazon SageMaker Unified Studio

So before we go deeper into the evolution of data architectures for AI, let's try to revisit what are the foundational pillars for the modern data foundation. So first of all, we know that we onboard data sources from multiple sources. That includes structured, semi-structured, and unstructured datasets. Once you have the datasets available, it goes through data cleansing, data enrichment, and a series of transformations. Now, what is changing in this AI era is basically in the data cleansing and transformation layer, how can you bring in AI that can fast track your data pipeline development and can accelerate efficiency?

Now once you have the data transformed, you need a metadata layer that includes both technical and business metadata. And once you have the metadata available, you can bring in multiple use cases for end users who can find, share, understand, access, and act on the data to build multiple analytics and AI use cases.

Now let's look at a high-level reference architecture that highlights the modern data foundation that is built on Amazon SageMaker Unified Studio. So as you see in the center of this architecture, we have the storage layer that includes a lakehouse, which is a data lake built on Amazon S3 or S3 tables, or a data warehouse or a data mart built on Amazon Redshift.

Now once you have the lakehouse layer, you need to onboard datasets through different mechanisms. The first one can be a batch ingestion mechanism where you have scheduled jobs to pull datasets from multiple sources, or you can have a real-time streaming pipeline that can integrate Amazon MSK or maybe Kinesis. And then you can have a Zero-ETL mechanism, which was a new feature we announced earlier, where you can onboard multiple AWS data sources as well as non-AWS data sources. But you also might have a scenario where you do not want to copy the data to the AWS landscape. Rather, you want to query the data or the subset of data through a live query mechanism where you can use query federation.

Now once you have the data in the lakehouse, the next thing that comes is a unified catalog where you can integrate Amazon SageMaker Catalog that provides you both technical and business catalog capabilities, and using that, you can also bring in multiple governance capabilities that include data quality, data lineage, data sharing, and more. Now you have the data platform ready, so you can bring in multiple analytics and AI use cases. When we talk about analytics, you can integrate maybe Amazon QuickSight for business intelligence use cases, or you can integrate AWS Glue or Amazon EMR to further transform the data for your downstream systems, or you can integrate Amazon OpenSearch for search analytics, or you can integrate generative AI application development using Amazon Bedrock.

Transforming Data Layers for AI: Two Key Approaches

So now we have a strong data foundation, right? How do we extend or improve on top of this to make it AI-ready? So when we talk about transforming your data layer for AI, we want to categorize that into two parts. The first part is, let's assume you have enterprise data now. You want to expose that to AI agents. There can be multiple mechanisms, right? One of them can be you're exposing a subset of data through API gateways. Second can be you are trying to convert a subset of your data as vectors and making those vector stores available as a knowledge base so that your retrieval augmented generation architectures will work.

Or the other way can be you have data lakes and data warehouses available, and you want to expose that as a tool to the AI agents so that they can invoke to run queries or maybe execute jobs. So we will talk about MCP and RAG in detail in future slides. I have a demo also. Now, let's talk about the second part. So we talked about you already have enterprise data and how are you exposing it to AI agents. The second part is, before the data arrives at your lakehouse, how do you bring in efficiency in designing your data pipeline itself?

Retrieval Augmented Generation (RAG): Real-Time Data Integration for AI Models

How can you auto-generate Spark codes? How can you auto-generate your visual ETL pipelines? How can you automate your data quality or data lineage jobs? So let's try to touch base on the first part, which is basically unlocking your existing enterprise data for AI agents. To understand that, let's go through a real-world use case. Let's assume you have a customer AI chat assistant that is getting built by a customer service team of a bank. They have fine-tuned their model based on their previous data, and the end user comes and the end user asks, what is today's thirty-year fixed mortgage rate?

Now, the AI model was trained some time back. It does not have the latest information, and it gives the outdated information confidently, saying today's mortgage rate is 6.7%. Now, what if you have a way you can augment the existing AI model with the real-time data so that it can use and answer the correct information? When you are trying to augment the data, maybe you can have a real-time pipeline also which can update the vector store or the knowledge base as and when the mortgage rate changes.

Now if we are able to build that, the AI model now will be able to answer as of today, December 4, 2025, our thirty-year fixed mortgage rate is 6.25% for qualified borrowers, and this rate was updated this morning and may change based on your credit score and down payment. Now augmenting the AI model with additional information is called retrieval augmented generation. Now, let's understand what are its benefits.

As I explained, RAG helps build or augment your prompts with the latest information, and it provides several benefits. One of them is improved accuracy, which is what we talked about. The second one is reducing hallucinations. Sometimes AI models, when they do not have the actual information, try to give wrong information. By providing the latest information, you are reducing that risk.

The next benefit is that you can bring in flexibility for domain adaptation. Let's assume you are trying to build a use case for the financial industry. You do not have an AI model which is specifically designed for the financial industry, so you create a knowledge base with financial industry data so that the model can use that financial data to give domain-specific answers.

Now that we have understood the benefits, let's look at a technical architecture that is built on the AWS ecosystem that implements RAG. This is a reference architecture. As you can see, we have integrated Amazon Bedrock here, which is a fully managed service that offers a choice of high-performing models using which you can build and deploy agents. It also provides additional capabilities such as Amazon Bedrock knowledge bases, guardrails, and more.

It offers native integration with several AWS vector stores, including Amazon OpenSearch and S3 vectors, and also several non-AWS vector stores such as Redis Enterprise Cloud, MongoDB, and more. For our specific use case we talked about, we are highlighting that we are integrating Amazon OpenSearch or S3 vector as a knowledge base behind the Amazon Bedrock knowledge base, but you do have the option to select either of them depending on the latency requirements you have.

Now, as you can see, we have the knowledge base behind Amazon Bedrock, which is Amazon OpenSearch S3 vectors. The question is, how do you keep it up to date when a mortgage rate changes? As you can see in the architecture, steps one to three show that you have an upstream application. The upstream application is receiving events as and when a mortgage rate changes, and it pushes that to Amazon MSK.

MSK receives that as an event, and then you might have a Spark streaming job running in Amazon EMR that acts as a stream consumer. It processes that incremental data and then converts that to vectors by invoking a vector embedding model from Bedrock, and then stores those vectors finally in the knowledge base. The next time when a user query comes through step four, the knowledge base will be able to augment with the latest information, and the model will be able to give correct answers.

Model Context Protocol (MCP) Servers: Connecting AI Agents to Enterprise Data

So let's expand our use case now. The AI model has answered the end user, saying the current mortgage rate is 6.25%, and it may change depending on your credit history and the down payment you're paying. Now, the credit history might be in your enterprise data, stored in your data lake or data warehouse. How do you expose your AI model to get access to your data lake or data warehouse? That's where we are introducing a concept called Model Context Protocol servers, or MCP servers.

These MCP servers will be registered as tools for your AI agents, and the AI agents will invoke the execution of SQL queries or maybe jobs that will get executed in your data lake and data warehouse. It will get the response back and answer the end user query. Let's understand what Model Context Protocol is and what its benefits are.

Model Context Protocol is an open source standard developed to allow AI assistants like Amazon Q or Kiro to get access to real-time data. In recent times, we have released a lot of MCP servers for data analytics services, including data processing MCP servers. That includes AWS Glue, Amazon EMR, and Amazon Athena.

It provides several benefits. As you can see, three services are integrated into a single MCP server, which provides you a single API for integration. It reduces your integration complexity and accelerates your development. It also provides AI-driven insights to optimize your data pipeline performance. In addition, it provides a simplified way for observability and cost tracking for your data processing services.

Now let's look at a high-level architecture for MCP. As you can see here, the end user is interacting with the MCP host. The MCP host can be an Amazon Q developer CLI or a Kiro CLI or a cloud desktop or a custom agent that you have built. The agent is trying to interact with the MCP server through a client. The MCP server can be many. I talked about data processing MCP servers.

Similarly, if you are trying to interact with Redshift, you have Redshift MCP servers. If you want to interact with S3 tables, you have S3 tables MCP servers. But beyond that, you do have flexibility to integrate custom MCP servers on other services. Now, this architecture highlights some of the AWS analytics services, as I said, and it's not limited to analytics services. You will also be able to invoke third-party models through APIs by defining MCP servers.

Improving Productivity with Amazon Q Developer and Visual ETL

So if you remember in the previous slide, we categorized the data transformation for AI into two categories. One was you already have enterprise data, you are exposing to AI agents, and we talked about how you can do that through MCPs and RAG. Now, the second part is while building the data pipeline itself, how do you improve productivity? So let's touch base on that. We have many use cases on how you can improve productivity. In the interest of time, I'll just touch base on two of them. One is auto-generating code or getting code suggestions through a single prompt, and the second one is building a data pipeline with visual ETL. Let's try to look at both of them.

So the first one I talked about is a low-code, no-code experience powered by Amazon Q Developer. Amazon Q Developer is the most capable generative AI assistant that is integrated into software development and the studios where the data practitioners use every day. You can ask Q questions in natural language prompts, and you will be able to get answers related to AWS service features, best practices, technical architectures, or it can also help you to troubleshoot query failures and Spark job failures. So this is a screenshot of the Q chat assistant integrated into the Jupyter Lab notebook in SageMaker Unified Studio. Radhika will go deeper into this.

This is the second use case I highlighted where you are trying to build a data pipeline. You do have the option to use Python-based DAG design, or you can use drag-and-drop design in an interface, but this AI chat assistant is helping you to give a quick start, where you are giving a prompt that says, this is my source, this is my target, and these are the transformations I want, and with a single prompt, you will be able to see that it is automatically generating an ETL pipeline for you. Let's look at a demo for both of them.

Live Demonstrations: Data Processing MCP Server and Auto-Generated Visual ETL Pipelines

So the first demo I'm going to show is basically the data processing MCP server. If you remember, I highlighted it can be added as a tool on the Amazon Q chat assistant. The left side of the screen is highlighting a form where we are specifying the data processing server parameters, which you will be adding. On the right side is a prompt we are actually going to give to the chat assistant to create a notebook for us. Let's look at the demo.

So for this demo to work, I am referring to a diabetes dataset. I have already downloaded that into my local system, and I have already created a SageMaker Unified Studio domain that is corporate. Let's click the domain URL which will take us to the SageMaker Unified Studio portal. In the portal, we select the project where we want to onboard the data, then we navigate to the data. We click Add, then do a create table. I already have the dataset in my local system. I browsed that, which is a CSV file, and it auto-detects a CSV file and detects the name, and then you click Next.

It scans the data, automatically derives the schema for it. It'll give you a preview of the schema which you can verify, and then click Create Table. Once you do create table, you can refresh your catalog. So you expand the lakehouse, then you go to AWS Data Catalog, you see a database called sales marketing, then you see the diabetes data table created. When you click preview of the data, an Athena query gets executed, and it'll give you a sample of 10 records.

So once you verify that your data is onboarded, now you can navigate to the Jupyter Lab notebook by clicking a menu under Build. In the Jupyter Lab notebook, we have the Q chat assistant, where you can see on the top, it gives you an option to configure MCP servers. So we click that, and then we click the Add icon, which gives us the form. If you remember in the previous slide I have shown, this is where we are adding the MCP server. We specify the name. We specify the command and the argument which will be pulling the latest of it. When you click Save, it'll activate the MCP server. It'll also list the tools it has, which you can verify. It might be a Glue database or listing a Glue table or the connections.

Once you verify that, if you go back, you will see the data processing MCP server is already added as a tool.

Now we can close that and go to the chat assistant, and let's copy the prompt I have shown to execute. The prompt says, I have a sample notebook. Use the notebook, reference the diabetes data, and try to use the MCP tool to generate a new notebook for me.

So now the Q chat assistant is scanning the command you have given. It is breaking that into multiple tasks, and you will see in the bottom, it is going step by step. First, it is listing your local directory because it needs to create a notebook. Then it will try to list Glue connections. It is asking your human permission. You are executing run. After that, it'll ask you to execute the list of AWS Glue databases. When you click that, it lists the databases.

Now within the databases, it will try to find, do I have the diabetic data table available. So once it finds that, it'll ask your permission to create a directory in the local file system because it needs to create the notebook. Now, it is creating the notebook, then after that, it has created the README.md file and the requirements.txt file. After it has created everything, it is summarizing what it has done in terms of the project structure, the activities, the code it has generated, and all.

Once you verify everything, you will navigate to the local directory to verify that you have a notebook already available. And then you can execute each cell to see the output. So now, let's look at the second demo, right? Where we wanted to highlight that you can auto generate a visual ETL pipeline by giving a simple prompt. So the top part is the prompt we will give, and the bottom is what we expect the Q chat assistant to create for us. Let's see the demo.

So for this demo to work, we needed a few data sets to be available. Now, that is already created in Amazon S3, which is basically a customer behavior CSV and a customer dimension CSV. Once you verify those two CSVs are available, we will navigate to the Amazon SageMaker Unified Studio. We have a corporate domain already created, and we will be navigating to the Amazon SageMaker Unified Studio portal. We have done previously this step where we select the project first, and then under the build, we navigate to visual ETL jobs.

Once you are on that page, you click Create Visual ETL Job. And you do have an option here to create a pipeline with a drag and drop interface, but we will show the generative AI capability with this prompt. And in this prompt, we are highlighting, use one data set as the customer behavior, the second data set as customer dimension, do a little bit of transformations by typecasting, changing, or renaming columns, then join these two data sets by state. And then create two aggregated columns, that is total purchase amount and total page views, and then finally store that into a new target table.

When you click submit, it will take some time to analyze the prompt and then creates the nodes automatically for you, where you have flexibility to click and edit the nodes. So now the pipeline is created, as you can see, and I explained you can click each node to edit, verify, and update. So we are first verifying this is the customer behavior CSV. We specify the delimiter as comma, and also we specify this CSV has header.

Now, once you are able to do that, you will see a preview of the data set available. Now, in this preview, you will notice that the page views and the purchase amount column has string type, and we want it to be integer type. So we will click the node for transformation, which is specifically to change the column type. As you can see, we are selecting the source column name, target column name, and we are specifying the type of it. We are specifying the type as integers.

Now, we will repeat the same step for the purchase amount. As you can see, the page view is now showing integer as the data type, and we will add the same step for the purchase amount field. Now, once we are able to do that, you will see the preview of the data set where the purchase amount is also now showing integer as the data type.

So, the next step will be joining these two data sets, but before that, we are verifying the second data set, which is the customer dimension CSV. Again, we are specifying comma as the delimiter, and we are specifying it has header. And once you do that, you will be able to see that you have proper column names. And then finally, the next step, which is basically joining the data set through a join, and we are using customer ID as the common column. And the next one is aggregation. When you are doing aggregation, you are aggregating by state,

and creating two columns: page views and purchase amount with a sum aggregation function. Once you are able to do that, you have an option to also add additional nodes. For example, if you want to rename these columns, maybe as total page views or total purchase amount, you can just do a plus node after aggregation and then look for the rename column function and apply that. Then finally, you save it and run it as a job.

So I'll summarize the demo here and hand over to Radhika to talk about the enhancements we have done in the AWS data processing engine. Over to you, Radhika.

Product Perspective: Enhancing Data Processing Engines for AI Readiness

Thank you, Sakti. So far you have seen how you can use RAG to strengthen your data architectures and a couple of cool demos. I want to provide a product perspective on how we as a product team are enhancing our data processing engines, which include Amazon EMR, AWS Glue, and Amazon Athena, to make them AI-ready for you to build your own applications.

Before we get there, let me just give you a quick recap of how customers are building AI applications. So here you see that the end user interacts with the generative AI application, typically by posing a question. The application then loads the relevant context and the conversation history. Now, before this can happen, when the user poses a question, it actually loads the prompt templates that can be applied to it. It could be one template, it could be multiple templates, and those templates refer to your Q&A or whether it is a code completion task or whether it is a summarization task, and so on and so forth.

Once it's done that, it loads all the conversation history if the user had already been engaged with the generative AI app on this topic. Now once the conversation history is loaded, the question posed by the user is then used to get more additional context based on the user profile. The user himself or herself can have preferences, specific permissions, and other settings that are situational in nature, such as the project that the user belongs to or maybe additional business rules that need to be applied. Now all this state information is stored in our data stores such as DynamoDB, and some of it comes from the data stores that Sakti was talking about earlier.

Once you have loaded your conversational history and you have the relevant context, the application then tokenizes this original question. To tokenize it, it sends it to, for example, Amazon Titan embeddings or OpenAI embeddings. Once the original question is tokenized, it derives the vector representation of that, and then it is sent back to vector stores such as S3 vectors or OpenSearch. Then, with those original question embeddings, the application performs a similarity search in the vector data store, and this is using some form of an approximate nearest neighbor search algorithm.

Once it applies that, it gets a bunch of results back, and it uses the top key results or document chunks as it calls from that. Once you have all those data available, it combines it with the original textual content, and then the data is synthesized and a prompt is engineered. That is the final prompt that will be sent to the LLM. The LLM uses that prompt along with all the embeddings and the original text content that was generated, and it processes the request and sends back a response.

The response is then sent back to the conversation history. It is sent back to other data stores as relevant, and then finally the response is sent to the end user.

So these are all the steps that are occurring behind the scenes or in the end user path as you run an AI query, right?

So let's take a look at what happens behind the scenes. All the processes that you see on the left-hand side are in the end user critical path. Now to boost the end user query, there are a lot of processes that are happening behind the scene, and these are the data processing architectures that you have built in order to enable these types of applications. When we ask the customers about what are some of the things that they feel are a requirement to build such applications and to boost your existing data architectures in order to enable building such AI applications, there are many things that both the field and the customers have come up with, including the need for high volume data ingestion and processing.

Now, as you all know, AI workloads typically process petabytes of training data. The AI models are sensitive to data quality issues, and you also require a variety of data that needs to be ingested at that velocity, right? So you have data quality demands, you have variety challenges, you have velocity requirements, and then you also want to apply identity and access controls based on those identities. And this identity can be a user ID, can be a user attribute, or a group association, right?

And in addition to that, you want to think about real-time and batch processing. This is critical as you get events generated. The AIs are pretty chatty. You have the user interaction chat generating a lot of data, and so is the data that you get from a lot of events generated through your devices, your streams, and so on and so forth. You also need a way to integrate an AI/ML environment where you can develop these applications. A lot of data scientists want an integrated AI/ML environment, and then you have need for model training and inference pipeline requirements as well, which include the ability to compute to train models with the massive amounts of data. You want low latency serving through your inference pipelines and so on.

And then lastly, you also want to build generative AI experiences into the tools and applications that you're building, right? So when we talk to customers, the top three that came from this list are identity and access control, the integrated AI/ML environment, and the building of the generative AI experiences. Now let's understand what that means in the context of the existing architectures that you currently own.

Take the humble data lake. Your organization has been building data lakes for many years now. Data scientists and analysts access raw data in S3, but the environment faces some challenges, right? Like if you take this example, you have three different users: Al, the data analyst; Joe, the data scientist; and a group of users who belong to a BI group. And all these users are essentially building different applications, and they're trying to access the data in your central data governance account. Now to do that, they're all using different roles. The challenge here is that there are an explosion of roles because as the number of users increases, you have to add additional roles based on the permission set that you require for that role.

And then as the user base increases, you want to also make sure that they are able to access multiple roles to access multiple applications. So there's a pivot happening here from where Joe the data scientist has to go through several AWS accounts and assume different roles in order to get access to, let's say, pink data or green data, right? In addition to that, it is very difficult proving who accessed what data and when, especially for sensitive data.

And so there is an AI readiness barrier here for data scientists who spend 80% of their time trying to find, access, and prepare data for building models. Now, if you extend the example to transactional data lakes or lake houses that you're building with your BI applications, more challenges later on where you need to support open table formats and also be able to manage different catalogs. Maybe it is the Glue Data Catalog, or maybe you want to federate to a third-party catalog like Unity or the Horizon catalog.

And so what we see, and if you go further on to a data mesh environment here, again you will see that you have a decentralized ownership scenario where you need federated governance and you want to be able to do peer-to-peer sharing and have a self-service architecture where you are able to authorize access to users in this complex domain-specific environment.

Right, so a few themes have emerged looking at the challenges with all the existing data architectures that we have seen so far, and they include identity and access control. Customers have been asking for a seamless single sign-on capability with fine-grained access control so that you can do identity-based access control. They want support for open table formats with comprehensive auditing capabilities where you can track end user actions, and then you want a very good integrated AI experience while performing all your tasks with your data processing architectures. So let's see how we are working with our data processing engines to enable some of these features in the coming slides.

Trusted Identity Propagation and Fine-Grained Access Control with Lake Formation

So starting with identity, a couple of years back we introduced a feature called Trusted Identity Propagation. This is a feature from AWS IAM Identity Center which enables administrators to grant permissions based on user attributes such as user ID and group association. It is built on the OAuth 2.0 protocol and it allows you to add context to your existing IAM roles, and the IAM role with the embedded identity context can be passed to the downstream applications and services so that they can either propagate that identity or, if it's the end service which is doing the authorization, it can authorize based on that identity.

Right, so here is how it works. So you have a user, Alice, who is authenticating herself into a user-facing application, and this can be your SageMaker Unified Studio or a custom portal that you have built, and you can integrate that custom portal with IDC and have your users authenticate to that custom portal. And now that custom portal can work with our analytic engines such as EMR, Athena, Redshift, and AWS Glue and be able to pass Alice's identity to these analytic services, and analytic services can access data using Alice's credentials.

Now if we look at some of the benefits this feature has, you will notice that it allows for enterprises to build SSO-like experiences where data engineers and scientists access Apache Spark sessions in Jupyter Lab Notebook in SageMaker Unified Studio using their organizational or corporate credentials and not the IAM roles, eliminating the need for separate credentials and streamlining your authentication workflows. You have end-to-end traceability for the user actions, which means that you have comprehensive AWS CloudTrail logging that captures all activities from interactive Jupyter Lab sessions to the background processing jobs that are running on EMR and Glue and Athena.

You have a centralized security management where administrators can implement fine-grained access controls from Lake Formation and apply permissions based on that, and these permissions can go granular.

This provides a simplified compliance model where you can implement it, especially if you're in a regulated industry. This is very useful in all these scenarios where you want to work with multiple data science teams or environments and you want to enable them with single sign-on access to all the AWS services.

Here is how it works in the context of SageMaker Unified Studio. You have Charlie and Elle who are users of SageMaker Unified Studio. They're logging into the SageMaker Unified Studio portal, and their credentials are passed from their project role to, let's say, a SageMaker training job. It could be an EMR job as well, and the credentials for Charlie and Elle are then passed further downstream when they're trying to access an S3 bucket using their own credentials. So here what is happening is when Charlie logs into SageMaker Unified Studio, SageMaker Unified Studio through its integration with Identity Center is able to authenticate Charlie using a token, their job token or the token that they get from their corporate identity provider. That token is exchanged for an Identity Center token which gets embedded into their project role and passed on downstream as they pass it to the analytic services.

So that was about authentication. Now let's look at how data access control works in this scenario. Again, when we talk to customers, we have heard several use cases where customers require different types of data access. The first type that is most common and prevalent is the coarse-grained access control for S3 locations. To enable coarse-grained access control, we have S3 Access Grants for S3 locations where you can specify SQL-like grants. You can define SQL-like grants on S3 prefixes and buckets and use those permissions to return S3 location credentials to your Spark jobs or to your Athena queries.

The idea behind S3 Access Grants is that when you're in scenarios or use cases where your simple IAM policies and bucket policies are hitting the limits, you want to switch to S3 Access Grants where you can overcome that. You can overcome the limits that you have on your bucket policies and simply use SQL-like grants to have a much simplified user-to-dataset permission mapping for your applications.

Now, this is how S3 Access Grants works with Spark sessions. It works with all EMR deployments starting with version 6.15 and AWS Glue with version 5.0 and above. The way it works is the user submits a Spark job. The Spark job runs on EMR or Glue, and when the Spark job runs, it requests credentials which is intercepted by S3 Access Grants. S3 Access Grants evaluates the permissions for that role that is running the job, and it returns the scoped-down credentials to access only that specific prefix or location that the job is trying to access. The job, the EMR engine, actually accesses the data in S3 using those scoped-down credentials.

Now, there are situations where you want to go beyond S3 and you want to use structured data. For example, your tables that you are storing in the Glue Data Catalog. In those scenarios, you want to be able to access data which is in your Glue Data Catalogs from your Spark sessions, whether it is interactive sessions through SageMaker Unified Studio or maybe you're running batch jobs. You want to be able to access that data and have it governed by Lake Formation. In those scenarios, you can use the full table access feature which we launched with EMR 7.9. The idea behind this feature is there are a lot of use cases where, for example, the data scientists need permissions to the full table instead of partial access.

There are situations where you have automated ETL workflows that need to process data and need to read the full table and write to a Lake Formation registered table. Or you're running applications where you want simple permission models where you simply want to give full permissions on a table for the users to do selects, inserts, or updates. In those scenarios, you can use full table access with Lake Formation from your Spark jobs running on EMR and Glue. This is a quicker implementation that you don't even have to enable Lake Formation on the cluster or on the EMR serverless application that you're running.

So here is how full table access works. There's an admin who actually enables full table permissions on Lake Formation for the tables that you're interested in. When the user submits a job, the EMR or AWS Glue jobs reach out to Lake Formation for getting the credentials. Now here, they don't go directly to S3. The credentials are rendered by Lake Formation, so these jobs reach out to Lake Formation. Lake Formation looks at the credentials for the runtime role for that EMR job, and then it vends the required scoped-down credentials for that user or the job role. Then those credentials are used by the EMR engine, the Spark engine, to read data from S3 and then return the results back to the user.

So there is a much more niche case where organizations have sensitive data that they want to protect, and in such scenarios you want to go with fine-grained access control. Now again, fine-grained access control allows you to grant permissions on your tables at a column, row, or even at a cell level. The idea behind fine-grained access control is that you can protect the sensitive information in your tables by minimizing the data exposure through these data filters that you can provide, and it is very useful in multi-tenant environments or in situations where you have multiple tenants trying to access with different functional roles. Also, the regulated industries that require stringent guidelines or policies to follow, this feature is immensely helpful there.

So here is how it works. You have an admin granting permissions to users or groups. Remember we talked about trusted identity propagation. So now, Lake Formation is also integrated with trusted identity propagation, which means that an admin can grant permissions at a user level. Once the permissions are granted in Lake Formation, when that same user submits a Spark job, the job, EMR takes the job role with the user's identity in it and reaches out to Lake Formation for credentials. Lake Formation evaluates the policies for that user, returns the scoped-down credentials on the table for that user or their group association, filters the data, and then it reaches out to S3, reads the data, filters the data, and then returns the results back to the end user.

So there is a whole slew of activities that are happening in the fine-grained scenario, and as a result there are stringent requirements on how you run fine-grained access control jobs on EMR clusters. As a result of that, because the filtering itself happens on the EMR engine, on the EMR Spark engine, we have to protect the area or the portion of the container which runs the filtering fleet or the filtering operation. As a result, there are some limitations here. We don't allow RDDs or anything that is invasive to securing the filtering fleet. So as a result, we have some considerations when you are using fine-grained access control.

There are no RDDs. We have limited support for debugging, and custom JARs and UDFs are also not supported in this scenario. So something to think about when you want to use fine-grained access control, and these are some of the limitations that we want to address in the future.

These are some of the limitations that we want to address in future versions of EMR.

Integrated AI/ML Experience: SageMaker Notebooks, Spark Upgrade Agent, and Performance Enhancements

So with that, let me jump to a different topic, which is enriching our data architectures with integrated AI and ML experience. Now, earlier you have seen a few demos from SageMaker. Those were run in SageMaker Jupyter IDE. We also launched SageMaker Notebooks, which is an immersive experience for your data scientists, your data analysts, and your serverless fans who really want to get started within seconds. It intelligently plans and executes complex workflows. It has the capability to choose the language that is best suited for the task, and it allows you to scale your workloads effortlessly.

Let me talk a little bit about what this means. There's a Jupyter IDE experience which has the full set of features that an enterprise needs, which includes the Identity Center with its Trusted Identity Propagation, a full project construct that will allow you to organize your work, and a multitude of other features. But there are also a lot of customers who prefer to get started within minutes, and for those customers, SageMaker Notebooks is a fantastic tool to get started with. You can start quickly without having to pre-provision any data processing infrastructure.

The notebook gives your data engineers and your data scientists a place to perform SQL queries, execute your Python code, process large-scale data jobs, run machine learning workloads, and create visualizations without having to switch between tools. These notebooks are powered by Athena Spark, which is based on Spark version 3.5.6, which supports Spark Connect. I think some of you are already familiar with the decoupled architecture for Spark Connect where the client and server are decoupled. So the client code can reside or can be part of your IDE itself, and your clients can simply issue commands to a remote Spark running on a cluster, such as EMR, or it could be an EMR serverless application.

This use of Spark Connect allows your notebooks to help your engineers get started with the notebooks very quickly and leverage the power of Spark to write different types of code in different cells. It has a built-in AI agent as well, which accelerates the development by generating code and SQL statements from natural language prompts while guiding your users through the process. Traditional tools typically have code completion and suggestions but do not help with data discovery, and multi-step planning and reasoning are missing in some of the tools that you see. But SageMaker Notebooks, as you can see here on the right-hand side, there is an agent that is actually already detecting what tables you have in your Glue Data Catalog, what is the context for your notebook, and it is able to provide you with enough context to run your Python notebook.

So let me talk a little bit about enriching your data architectures with Gen AI experience as well. In this area, we have introduced Spark Upgrade Agent, which is a new feature that is available on EMR on EC2 and serverless for Python and Scala Spark jobs. It uses MCP-based tooling through conversational interface to upgrade your Spark from, let's say, 2.4 to 3.5. It has multiple steps that take you from planning to code edits, and it actually has error-driven validation loop and data quality guardrails in order to make the upgrade complete and correct.

The Spark upgrade agent workflow works as follows. There is a planning and orchestration phase, a build and dependencies phase, a code modification phase, and a data quality guardrail phase. Let's see this in action through Sam, who's been tasked with urgently upgrading their production Scala-based Spark applications from version 2.4 to 3.5 after a critical security vulnerability was discovered. Normally this would take weeks or months to complete. Let's watch how the agent upgrades it very quickly.

In the interest of time, I'm going to run through this very quickly. As you can see, we are looking at a sample application that needs to be upgraded, and what you see here is a Maven file that was for 2.4. Then we are seeing a sample code that has 2.4 based casting mechanism, and we'll see how the agent detects that the 2.4 Spark is different from 3.5 and it automatically manages to use the correct API.

I've posted the question into the queue agent. We have taken steps to launch the MCP, activate the MCP server as before, and we have posed the question. As you can see, it is building a plan very quickly to walk through the steps of the upgrade. There are multiple steps here including the upgrade for the dependencies, running all the unit and integration tests, validating on EMR on EC2, and it generates the full upgrade summary.

Once it does that, you can see that it is running through each of those steps by first doing the build update of the build configuration in Maven. Once it completes that step, it shows you what are the major versions in each of the Spark versions, and then it also shows you to what versions it'll upgrade when it goes to 3.5. As you can see, it is showing you a full breadth of the upgrades that are required, including highlighting the differences in the Maven build file.

Then it goes through a bunch of steps in order to make sure that it's not just the Spark version, but the underlying dependencies are met. When you upgrade each of these dependencies, each of these dependencies also require additional steps or tasks to be run. For example, Java requires additional arguments to be passed, and so you will notice that it adds additional arguments where Java internal modules need to be accessed. So all that missing code is added here.

Then it runs the validation test and when the build fails or the validation fails, the Spark upgrade step actually goes through that step by step fixing each one of those errors until it reaches a clean build. It runs a sample test that works on 3.5 exactly like it runs on 2.4. We'll show more on this demo in our web pages and blogs. You can catch this in many other sessions as well.

But in the interest of time, I'm just going to quickly close out on a couple other features that we have launched, which includes the fastest Spark. Now we are 4.4 to 4.5 times faster than open-source Apache Spark. We are 2 times better for writes compared to open-source Spark and Iceberg. We also introduced something called EMR Serverless storage provisioning. The idea behind this is to provide you with an external remote shuffle storage so that your EMR jobs can eliminate the disk bottlenecks and out of disk errors that occur.

There are multiple features that we have launched. This is just a glimpse of what we have launched in this year. There's more on catalog federation, materialized views that you can check out in various other talks and our blog posts. So with that, thank you all for coming, and I hope you'll give these features a try and share your feedback with us. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community