Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Accelerating data engineering with AI Agents for AWS Analytics (ANT215)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Accelerating data engineering with AI Agents for AWS Analytics (ANT215)

In this video, Shubham Mehta, Product Manager for AWS Analytics, introduces AI agents designed to reduce the 60-70% of time data teams spend on undifferentiated tasks. He announces two major launches: the industry's first Apache Spark upgrade agent for Amazon EMR that automates upgrades from Spark 2.4 to 3.5 through planning, code modification, building, and data quality testing; and the SageMaker Data Agent, a role-specific agent aware of business data and catalogs that performs multi-step planning for tasks like building machine learning pipelines. The demo showcases building a customer lifetime value prediction model using linear regression with 81% R² score, demonstrating the agent's ability to handle SQL queries, data visualization, feature engineering with one-hot encoding, and polyglot notebook capabilities across 320 preinstalled packages. Both agents use MCP-based tooling for IDE integration.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Challenge: Data Teams Spending 60-70% of Time on Undifferentiated Tasks

Good afternoon, everyone. Today, we will be discussing AI agents and AWS Analytics. Let me start with a number. Data teams spend 60 to 70 percent of their time on undifferentiated tasks. Data engineers are spending months doing upgrades. Data scientists are spending a significant portion of their time preparing data for analysis. Platform teams are firefighting issues.

Now today we do have AI assistants. We do have AI assistants which can take you the first step and give you the code. However, a lot of times these AI coding assistants are not aware of your data, your resources, or the things that you have already done in your environment. That's why a lot of the times they will give you code that will not work, and then you end up repeating and changing the code, trying to customize it for your environment to make it work. But today we are going to change that.

I'm Shubham Mehta, Product Manager for AWS Analytics, and I've been focused on bringing AI agents to analytics directly within your workflows. Today we are going to talk about AI agents that can actually solve the problems that you're looking to solve. They can help you upgrade your Spark code. They can help you build code faster.

Before we get into details, let's quickly look at the agenda. We are going to talk about the overall overarching strategy of how we are approaching AI agents in the analytics space. We're going to look into one of the agents, which is the SageMaker Data Agent, and then we are going to look into a Spark upgrade agent for Amazon EMR, which can help you upgrade your Spark code.

Now, before we get into details, let me first describe the problem a little bit further. When we talk with customers, we hear problems in three specific areas. First, we are seeing workflow complexity. Every task that you want to do in your organization requires multiple tools to be used, which means that if you're trying to build a machine learning model, you will end up having to use four or five tools for that simple task. If you're trying to write a Spark application or upgrade a Spark application, you need to understand your build system, you need to write the Spark code, you need to upgrade and change the Spark code, and so on.

Second is the knowledge gap. You don't have a single engineer who is expert in everything. You have multiple engineers who are expert in different aspects of the entire workflow, where some of them might be good in Spark and some of them might be good in SQL. Lastly, we do see the capacity crunch where the data is growing ten times every three years, but the data teams are not growing at the same scale.

Now we think that AI agents can actually solve this problem because the agents that we are trying to build can orchestrate workflows end to end. They can actually embed deep expertise for Spark, SQL, and other complex engines within them, and they can scale infinitely.

AWS Analytics AI Agents Strategy: Domain-Specific, Role-Adapted, and MCP-Based Interoperability

Now, before we get into the agents, let's look at what are the guiding principles that we have for AI agents in AWS Analytics. Four things guide our approach. First of them, domain specific agents. We think that over the years we have learned a lot about what are the problems customers face with Spark, what are the problems customers face with SQL, and that's why we are trying to embed all that expertise in the domain specific agents.

Second, we want the agents to be adapted to your role and to your tool. If you are a data engineer or a data scientist, you get a different experience than what you get as a software engineer. Third, the multi-agent ecosystem. We don't think that one agent can solve all the problems. Not one engineer was enough to solve everything, so we are building an ecosystem of agents that can work with each other to solve certain aspects of the problem and then collaborate to solve the entire end-to-end analytical workflow.

Lastly, we believe in MCP-based interoperability. Because we know that you are used to using your IDEs, you are used to using your tools, and you want to utilize these agents where you're working rather than going to the AWS console for each and everything.

Now, let's look into the launches that we have done around this. We have covered the high level principles. Let's see what are the launches that we have done so that it's not just principles. We have actually followed these principles and we are actually making it reality.

Apache Spark Upgrade Agent: Industry's First Automated Spark Migration Tool for Amazon EMR

The first agent that we have launched just yesterday, we are proud to announce the Apache Spark upgrade agent. This is the industry's first Spark automated upgrade agent. It can take you through the complex process of Spark upgrades from planning to code edits to building your Spark application and doing the data quality tests across the data and across your application.

This agent is available as of yesterday for Amazon EMR on EC2 and EMR serverless, and it can take you from Spark 2.4 to Spark 3.5. Spark 4.0 support is coming soon, and this entire agent is based on MCP tooling. We are launching a remote MCP server where you can configure this remote MCP server in IDEs of your choice, and you can use it there directly. Then after you've set up the MCP server, you just say that I want to upgrade my Spark application from this version to that version, and here is the Spark application in the project, and it will read the code and it will go through all the steps.

We'll look at the steps in detail in a second. Now it takes you through four steps. First of them is planning and orchestration, where as you give your project, we analyze the structure of the project. We see how you're doing the Spark submit to EMR because based on that, the approach to the agent differs. We see what kind of language you're using, what kind of integration tests you have in your project or not, and based on that, we define what are the steps we will do during the entire upgrade process.

Here you can actually give the feedback that no, I don't want to take this step, I want to ignore the integration test, can you actually update the upgrade plan, and it will update the upgrade plan. Then it will look at your dependencies whether you have pom.xml or requirements.txt. It will go through them and identify that okay, these are the dependencies that need change for Spark 3.5, let me make those changes, and it will then go through your build process and actually build your application.

Right now it supports Maven and SBT-based build, but because it is MCP based, you can bring in your MCPs to hook into your own custom build process and actually ask the agent to use your build process to build the application. Once it has gone through all the dependencies, it will give you a list of updated dependencies that you can then verify and see that okay, this makes sense, let's go to the next step.

In the next step, it actually looks at your code and looks at all the changes, breaking changes that Spark has introduced, and makes sure that your code is not having any of those breaking changes. This entire process of code modification is based on an error-driven loop where we try to run the code in the EMR cluster that you have given for the target version and we see okay, what is the error your application is facing. Then we have built a first of its kind knowledge base of all the Spark breaking changes where we make sure that we make the minimal changes in your code in order to make sure your code runs successfully in the target platform.

Then the last step is we make sure that the data that the application is producing after the upgrade actually adheres, is exactly the same as the data that you started with in your prior Spark version. Because it's not just about running the application successfully, it's also about making sure that the data that it is producing actually is the same as what it was producing before.

SageMaker Data Agent: Building Machine Learning Pipelines with Catalog-Aware Code Generation

The next release that we have done in this segment is Data Agent. Now, the previous one was a domain-specific agent. In this case, we have built an agent that is specific to data engineering and data scientists and data analyst personas. What makes this agent different is that it is aware of your business data and catalog. It uses MCP-based tooling to get all the information from your catalog, what tables you have, what's the schema of your tables, and then when you ask it to write the query, it actually writes a query that can run without you modifying anything.

Now, this agent can also do multi-step planning. This can take you from complex tasks, let's say building a machine learning pipeline. It will divide it into five to six steps. It will take you through each of those steps, and we'll see those in action in a second. It will help you actually write the code for each of those steps separately, giving you verification that things are working fine or not.

If you run into an issue, you can actually troubleshoot using a Spark troubleshooting agent that is running behind it. So let's say you're writing a Spark code and you run into an issue. In this case, we actually rely on a Spark troubleshooting agent, which is a third agent that we have, to fix the Spark specific issue.

And if it's not a Spark-specific issue, we resolve it without relying on the troubleshooting agent. It also has security guardrails built in where it prevents any destructive action in your account. If you ask it to write code that deletes your table, it will give you appropriate warnings and make sure that you are aware of the actions that you are taking.

Now I'll go over the demo, and this demo will take up the role of a data scientist who is trying to build a machine learning model to predict lifetime value of customers. The reason we want to predict this lifetime value is because we want to provide the customers the right incentive early in their journey so that we can make sure our business grows with them. Let's go over the demo quickly.

Now, in this case, we are in the notebook interface. On the left-hand side, on the right-hand side, you have the SageMaker Data Agent. We are simply going through the discovery phase where we are asking it, can you list all the tables that I have in my database? It finds out that you have three tables in your database that you wanted to look at. In this case, we found out that there is a digital wallet LTV table.

Now, the first thing is I want to see what is the sample data in the table. Here, I actually ask it to use Athena SQL to write a query for this digital wallet LTV table, and the agent gives me a very simple query. We select from SageMaker sample database, and this is the new notebook that we have launched which actually renders the data frames as interactive tables. You can see what are the columns I have, I have this customer satisfaction score, support ticket, and preferred payment method.

Now I asked the agent, I want to explore more. Can you actually help me analyze the impact of customer satisfaction score on the LTV trends? Now the agent will actually understand that you are trying to use the data that you have already loaded in your notebook, and it will use that data to actually build further. If you see, the agent actually says that let me create the visualization to show, and I will create the multiple charts. In this case, the agent used the same data frame that you had loaded using the SQL, and this data frame is now used in Python code.

So these notebooks are actually polyglot in nature where you can use the work that you've done in SQL and you can use them in Python. Now in this case, the agent comes up with graphs. You can actually see that with customer satisfaction score, the LTV is actually increasing, and the agent has created this entire code. I didn't have to edit a single line. This entire code was written by the agent and running successfully, and here the agent also created this high-level overview of what is the minimum and max LTV, median LTV for different satisfaction scores.

Now we get to the final task. I want to build a linear regression model to predict LTV, and I want to do an 80 to 20 split, and I want to use one-hot encoding. In this case, the agent will come up with multiple steps in order to do that task. So in this case, the agent comes up with the code for each of the steps. The agent actually comes up with multiple steps, and I'll go over the code in a second.

You can see in the first code, we are trying to understand what are the categorical features so that I know how to deal with these categorical features in my actual run. I identified that there are four or five categorical features, and I will be using one hot encoding. The agent uses one-hot encoding to actually divide these categorical features into multiple steps. The agent has produced the entire code with 23 features, 18 features for numerical, and then the rest of the features were categorical features created using one-hot encoding. It has created the model, and the model is 81% R² score, which means that it is able to explain the 81% variability.

Finally, it created the graph for the predicted LTV where it is able to show that this was the actual LTV trend and these were the predicted LTV trends. This was the predicted LTV and these were the actual LTV trends. So in this case, we created a simple model, but we could have gone and created a more complex model as well.

Here we can actually see the feature importance where we can see that the low income level actually decreases the LTV, and if you have a middle income level, it also decreases LTV. But if you have a high customer satisfaction score, it increases LTV. You're able to see that entire code was created by the agent end to end. It went through the splitting, the training and test split. It went through feature engineering. It prepared the data for one-hot encoding, and then actually conducted the analysis and provided feature importance results.

Now, you can use this same agent for building data pipelines. If you're trying to build data pipelines or want to run queries on S3 tables or on the AWS Glue catalog, the same agent can actually write Spark code, DDB code, or Polars code to run those data transformations using any engine that you prefer.

In this notebook, we have actually preinstalled around 320 packages, including all the essential packages that you would need for your complex analytical tasks, including the SageMaker SDK. Of course, we allow you the freedom of installing more packages in case you would prefer. Now, if you want to learn more about these two agents, I would highly recommend scanning these QR codes. These QR codes will lead you to documentation, and you can see more about what other capabilities the Data Agent has and how you can set up the Spark upgrade agent.

In the Spark upgrade agent documentation, you can actually see the remote MCP server configuration that we have. So I would highly recommend you go in, use VS Code or whatever IDE you have, set up the MCP server, and actually use this agent to do the upgrade for your sample application, and then use it to upgrade your production applications. Now, this is a wrap of our lightning talk today on how you can accelerate data engineering. I really appreciate all of your time for coming in here and listening to me.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community