Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - FINRA: Accelerate Massive Data Processing with NVIDIA on AWS EMR (AIM279)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - FINRA: Accelerate Massive Data Processing with NVIDIA on AWS EMR (AIM279)

In this video, Felix from NVIDIA and Alain Menguy from FINRA discuss the RAPIDS Accelerator for Apache Spark, which enables GPU acceleration without code changes. FINRA, processing 1.5 trillion market events daily and operating over 1,000 petabytes in AWS, achieved 50% performance improvement and cost reduction on TPCDS benchmarks and production workloads. A join-heavy pipeline processing 11 terabytes with 50 billion records per table saw similar gains. However, success required collaboration with NVIDIA to optimize configurations and address GPU memory limitations and CPU fallback issues.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introducing the NVIDIA RAPIDS Accelerator for Apache Spark: GPU-Powered Data Processing

Welcome to our session. My name is Felix. I'm the product manager for the Accelerator for Apache Spark. Today we have a full agenda with lots to cover. We're going to introduce the NVIDIA RAPIDS Accelerator for Apache Spark first, and then my colleague from FINRA will share their story about leveraging this technology to accelerate their workloads.

Apache Spark is one of the most popular data processing frameworks available today. It's used by almost every enterprise and organization for various needs, whether for analytics, reporting, or machine learning. More importantly, data is growing at an exceptional rate over the last couple of years, particularly with the rise of AI. In order for enterprises and organizations to handle this exceptional growth, we're introducing GPU acceleration.

What you see on the right-hand side is our stack. You have your Apache Spark workload and Spark application running as is. We're introducing a plug-in that you can install into the workflow with no code changes, leveraging the GPU acceleration layer running on the cloud or on-premise. This delivers cost savings and faster results. Although GPUs do have some cost associated with them, the acceleration is substantial enough that you get your data processed faster and cheaper overall.

To give you an idea of adoption, these are companies and enterprise organizations that have publicly discussed their use of this technology. In the last two years, there are definitely more, but you can see a pattern across online services, retail, and finance services, which is particularly relevant today. Let me share an example around fraud detection. With fraud detection, you often need to process billions and billions of records, going back five, ten, or fifteen years to look for patterns. This pattern analysis uses time series windowing operations and analysis, which is something GPUs excel at. As a result, we're able to achieve 14 times speedup in this particular workload with 90 percent cost savings. If you want to learn more about this, we actually partnered with the AWS FSI team last year to publish a post. You're welcome to use the QR code to open the webpage and see it yourself. We'd love to be in touch with you as well.

FINRA's Journey: Processing 1.5 Trillion Market Events and Breaking the One-Hour Benchmark

Without further ado, I'd like to introduce our colleague from FINRA to talk about their experience working with this technology. I'm super excited about this because last year, our presentation is actually what got them interested, and this is how we started working together. Perfect. Thank you, Felix, for the great introduction on Apache Spark RAPIDS running on NVIDIA GPU chipsets. I'm Alain Menguy, Senior Director of Big Data and Performance Engineering at FINRA. I've been working at FINRA for more than twenty years, and I can tell you that I've been involved in many transformative and innovative projects, and this one—Spark RAPIDS on GPU—is absolutely one of them.

First, let me tell you who FINRA is. FINRA has been around for a while. We were celebrating our eighty-fifth anniversary last year. Our mission is twofold: market integrity and investor protection. We ensure that all participants can trade with confidence and that the markets operate free from market manipulation and fraud. Basically, we are protecting you as investors to ensure that your money is safe and sound. The nature of our business makes us an AI and big data company. We are operating over one thousand petabytes of storage in the AWS cloud.

In March this year, we processed a peak volume of 1.5 trillion market events for a single day. Yes, I repeat, 1.5 trillion market events in a single day. Processing this volume requires a massive technological infrastructure and sophisticated systems.

Our business flow works as follows: after the market closes at 4 p.m. Eastern time, member firms have the obligation to report their trading activity before 8 o'clock the next day. From there, we have only a few hours to reconstruct the market activity. Then we run hundreds of pattern models looking for market manipulation and fraud. It generates alerts if it finds any abnormalities, and those alerts are sent to our investigator team, who can then interactively query the system to gain a better understanding of those alerts.

Market volume keeps growing. Over the years, it has been growing dramatically, and the trend is still going to go up over the next couple of years. Modern electronic trading platforms and high-frequency trading systems generate far more transactions than before. As volume scales up, so does the computing need to handle that volume. Our challenge is not only business logic but also scale and cost efficiency, and that led us to evaluate the GPU acceleration option.

FINRA needs to constantly adopt new technologies. We cannot stand still. We are always looking for ways to develop new capabilities that will reduce costs, improve performance and concurrency, as well as strengthen our resiliency and security. Whatever we adopt must make performance and economic sense.

As you know, AWS and the open source community are constantly releasing new hardware, new services, and new execution models. For us, evaluating this innovation is not something we do once a year. It is part of our DNA. This is something we are constantly doing. It is not about chasing shiny objects. It is about deciding which EMR version, Java, or Spark we will be using. This ongoing benchmarking guides our architecture, roadmap, and platform decisions.

At FINRA, we rely on a benchmarking suite that blends industry standard and real FINRA production workloads. We use TeraSort to stress large-scale I/O and TPCDS to evaluate SQL planning, joins, and aggregation. We are also benchmarking 66 of our regulatory workloads with very large data sets, heavy shuffle, and complex join patterns.

For the sake of time, I will skip that slide, but understand that we are doing a fair comparison between the runs. We have the same input data set, the same Spark configuration, and the same cluster configuration between runs. Of course, we validate the output. We are making sure that the results do not change between the runs.

This is really Moore's Law in action. Look at the performance improvement over the years just from the natural evolution of EMR, Spark, Java, and underlying hardware. These are the results for our TPCDS 9 terabyte benchmark that we have been running through all those years. No code change, same data, same logic, just newer Spark version, newer engine, and newer instance families.

We went down from 9 hours 5 or 6 years ago when we were using EMR 5.24 to 1 hour and 45 minutes with the latest EMR 7.10 that was released just a few weeks ago. That's a 5 times performance improvement, and you can imagine the cost reduction by jumping on those newer technologies and newer instance families. This is really Moore's law in action. That's where Spark GPU entered the story.

I saw a slide a few months ago from the NVIDIA team presentation, and it really caught my eyes. We are early adopters in the cloud. Our first workload on the cloud was in 2013, back in the Hadoop era, when we were using Apache Hive to run our SQL queries. Then Apache Spark came along and changed the game with in-memory processing, giving us a major leap in speed and efficiency. When I saw that slide, I thought this is really speaking to me. So the question became: is GPU-accelerated Spark the new leap in technology? That's what we wanted to try out.

So let's use TPCDS with 9 terabytes. Breaking the one hour mark. We ran the exact same workload that was running for 1 hour and 45 minutes on EMR 7.10 with the latest GPU instance type, and it completed in 53 minutes exactly. That's roughly 50% reduction in processing time, and it's also roughly 50% in cost savings because the cluster is running for a shorter period of time. Faster, cheaper, without doing a code change.

Real Production Workloads: Collaboration, Optimization, and the Path Forward for GPU Acceleration

That's the moment we said, well, let's do it on some real production workload. This is not synthetic. This is a real production workload. The total input of this dataset is 11 terabytes, and this pipeline is really join-heavy and shuffle-intensive. We are joining two very large tables with 50 billion records in each of them. One has 125 columns, and the other has 25. We are joining by processing dates and some raw ID, which is one of the business logics in our application. The data is partitioned by trading date.

We didn't change the code. This application is written in Scala, and we ran it on both CPU and GPU. Again, we saw 50% performance reduction and lower cost of around 45%. No code change, and this is really where we started to sell ourselves on this. There is something we really need to explore because this is mind-blowing from a performance and cost-saving opportunity.

One of the key takeaways I want you to remember is that we didn't just flip the switch to enable that massive performance gain. On our first try, the GPU run took 5 hours. We contacted the NVIDIA team, and we had very good collaboration between the two engineering teams to identify what the bottleneck was. We explored the plan, saw that sometimes the GPU was falling back on the CPU, identified a bunch of bottlenecks, and the NVIDIA team came up with a new plugin based on those findings for this special workload. So a key takeaway is yes, no code change, but it's not plug and play, not always.

Then another workload. This is the key data extraction. I've been telling you that the member firms are submitting their files to us, so they are sending CSV BZ2 files.

We need to decompress them, read them, pass them into typed columns, and then convert them to Parquet format using the Snappy compression protocol. Since firms are sending us hundreds of thousands of files on a daily basis, this application runs quite a few times over a single day, making it an iterative process.

Initially, CPU and GPU performance were even. We collaborated with the NVIDIA team to identify the bottleneck, and we discovered that the original application was using the Dataset API, which was not a good use case for the GPU. We transformed to the DataFrame API, and through multiple iterations, we were able to reduce the time by 2x and eventually by 10x, which was a major improvement. This strong collaboration with the NVIDIA team was essential to achieving these results.

Looking at industry benchmarks and some FINRA production workloads, we found that we were able to get 2x faster joins and lower costs on some of the workloads we tested. RAPIDS integrates very well with Spark and DataFrame pipelines. In some cases, no code changes were needed for certain pipelines, and the runtime and cost savings were consistent once we tuned the configuration. However, GPU memory had some limitations, and spill to disk needed to be managed very carefully. Some operators fall back to CPU, creating mixed execution plans, and the RAPIDS configuration is not straightforward. Open source can be finicky and tricky, so one change in a parameter can cause dramatic performance improvement or sometimes performance degradation. Additionally, GPU instance availability was not always easy to obtain, especially if we needed multiple instances, so we had to establish relationships with NVIDIA to secure capacity for these types of workloads.

This marks the end of the first proof of concept we conducted with GPU and Apache Spark RAPIDS. Not every workload gets an advantage, or you may need to spend considerable time identifying the bottleneck. Now we have a process where we can validate CPU versus GPU execution time, so we have a process to test more workloads. We are hopeful that GPU will evolve from experimentation to a strategic performance level based on the performance benefits and cost reduction we have seen. This is an evolution for us, and we are going to add GPU and Spark RAPIDS into our big data work stack. However, today, CPU remains our default, but GPU certainly will be our future accelerator path.

Thank you, and we are happy to take questions. Welcome to try it out. You can visit the URL provided, and we welcome you to contact us at the email address shown as well.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community