Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Illumina DRAGEN pipelines on F2 instances with Nextflow & AWS Batch (CMP353)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Illumina DRAGEN pipelines on F2 instances with Nextflow & AWS Batch (CMP353)

In this video, Marissa Powers from AWS and Sean O'Dell from AstraZeneca Centre for Genomics Research discuss migrating genomics pipelines from F1 to F2 FPGA instances. The migration achieved a 62% performance speedup and 71% cost reduction while maintaining exact equivalence in results. They demonstrate the architecture using AWS Batch, Illumina DRAGEN, and Seqera Nextflow for processing thousands of samples with dynamic resource provisioning. Key technical details include F2.6xlarge specifications with 24 vCPUs and improved cores per chip, concordance testing methodology, and storage recommendations using local NVMe versus EBS volumes.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: AWS and AstraZeneca Collaboration on Genomics Pipelines

Hi everybody, welcome. Thank you for your patience. We are ready to go. This is a 20-minute lightning session. In the next 20 minutes or so, we are going to talk about work that we at AWS have done with AstraZeneca running their genomics pipelines on the cloud, specifically on our latest generation of F2 instances. I am Marissa Powers, a high performance computing specialist solutions architect focused on life sciences. I spend most of my time working on drug discovery R&D with our largest pharma customers like AstraZeneca. We are fortunate to be joined by Sean O'Dell from the AstraZeneca Centre for Genomics Research, and this is work that we did together.

We are going to cover why we did this work, why we run these genomics pipelines on latest generation F2 instances on AWS. Sean is going to cover that piece. We are going to cover what we discovered when we compared to previous generation instances, and because this is re:Invent and we like to go into the details, we will talk about how we built it, the architecture, and we will end with a demo. Before we go through why, what, and how, let us cover who. I am Marissa, and this is Sean.

Manu Pillai is a fellow HPC Life Sciences solutions architect based in the UK. He built out the AWS Batch and Nextflow infrastructure that we use to run these pipelines. If you do not know what those things are, that is okay. We will cover it in the next few slides. We also had lots of help from Hijun and Shamal from Illumina, Natalia, Eric, and Omar from AWS, and Gabriel Hernandez at AstraZeneca did the concurrence testing which Sean will cover.

AstraZeneca's Genomic Processing Requirements and F1 to F2 Migration Results

With that, I will hand it to Sean to talk about how AstraZeneca runs these pipelines on AWS. Great, thank you for that, Marissa, and thank you for joining us today. At AstraZeneca, we think about genomic processing in these three categories. First, we develop standardized workflows and pipelines across different modalities. For example, for exome, we run a slightly different workflow than we would for a whole genome, and it is really important that when we do a migration or when we upgrade these workflows, we are able to demonstrate that the results after the upgrade are exactly the results before the upgrade. This is really important in genomics processing, where some of the results might be used to impact drug discovery or clinical trials, so we need to be able to go back at any given time and reproduce those results.

The next thing we look at is that this type of processing produces massive amounts of data, like petabytes of data and tens of millions of files. We need to be able to store and retrieve that in a cost-performant manner. We need to have some type of catalog that keeps track of all those files, and then eventually we need to take the insight that we extract from the data and make that available to scientists in downstream analytics. The third point I wanted to make is that this workload is very spiky. We receive large batches of samples during the course of a year, for example. It is not a steady-state type of processing.

When we receive these large batches of samples, we want to get through those as quickly as possible using Illumina DRAGEN from Illumina so that we can get benefit from the costs that we have invested in that. We need to do it very quickly. When we think about this migration and going from F1 to F2, we need to have some kind of test case, because we need to prove that the results in F1 match the results in F2. In this case, it is very simple. We just take samples from each modality like exome sequences and whole genome sequences. It is a known set of data. We want to run exactly the same DRAGEN command line on F1 as we do on F2, and we want to use the exact same reference files. It is a pretty straightforward thing to conceptualize in the sense that F1 should give us exactly the same results as F2. But we are hoping that we will get performance gains and we are hoping that we will get cost reduction.

When we did this testing, you can see here that we have up to 62 percent speed up, and the only thing that is different is moving from F1 to F2. The DRAGEN software version is exactly the same, as I mentioned. The samples are exactly the same, and the reference files are exactly the same. You can see here this performance speed up is consistent across whether it is an exome or a genome, so we are quite happy about that. We can do the same amount of work in half the time, so to speak. From a cost point of view, we see a 71 percent reduction in cost.

Which is quite significant. It means we can take that money and reinvest it into different parts of the business and into different things to drive different aspects of science. That is the speed up we got and the cost reduction we have.

We wanted to do equivalence tests to make sure the results from F1 match exactly the results in F2. I'm happy to say we can confirm that with bioinformatics tools and can confirm that exactly the variants that we see when we run on F1 and F2 are the same and all the metrics that Dragon produces are exactly the same, minus the stuff in the header that depends on date and so forth.

Technical Architecture: AWS Batch, Illumina DRAGEN, and Seqera Nextflow Integration

With that, I'm going to turn it back to Marissa and she's going to go through how this was done and the architecture of AWS. As Sean mentioned, we see a 62% speed up with next gen FPGA-based instances and a 71% cost reduction. Let's talk about why that is.

These are the specifications for the two instance types. The first row is the F1, which is prior generation. The second row is F2, which is latest generation. We want to point out a few key factors here. The first one is the number of vCPUs. The F2.6xlarge has 24 vCPUs as opposed to the prior generation which has 16. That's 50% more cores.

If you look at cores per chip, you'll notice that the F1 has two chips versus the F2 has one. It's actually a 200% increase in terms of cores per chip, and these Dragon processes run on a single FPGA chip. So the number of cores per chip is the largest contributor to the performance increase that we see. Additionally, the F2 instances have up to 16 gigabytes of high bandwidth memory.

As we mentioned before, there's a 62% performance speed up, but even a proportionally larger cost reduction. Why is that? It's because the later generation instances actually have a lower per-hour cost. So you can get even if they had the same performance, you'd have cost savings. They perform better and you have proportionally higher cost savings.

This is helpful for capacity planning. It's helpful to know when you have to process tens of thousands of samples a year as they do, to understand what the cost implications of that are going to be. Let's talk through in more detail how we did this testing and how many of our customers architect for running genomics pipelines at scale.

The first piece we'll talk about is AWS Batch. If you're not familiar, AWS Batch is our managed service for HPC orchestration for containerized jobs. When you think about Batch, think about containerized jobs at massive scalability. When we say massive scalability, what do we mean?

This is a chart from a re:Invent talk a few years ago. It's a virtual screening run we did with Dana Farber on Batch. What you see on the x-axis here is time in hours, and the y-axis is number of vCPUs. We were able to scale up to 2.2 million vCPUs and back down to zero over the course of 4 hours, and that was done on Batch. So again, Batch is for containerized jobs at massive scalability.

We talked about Batch. We know that we ran these jobs on FPGA-based instances. Two other key components for this testing and this work is Illumina DRAGEN and Seqera Nextflow. DRAGEN is a set of FPGA-accelerated tools and pipelines. It's commercial software published by Illumina. It allows you to run in 35 minutes the same pipeline that would take over 8 hours with commonly used open source tools on x86 hardware, and it's available on AWS as a marketplace machine image.

For Seqera, for folks who aren't familiar, Seqera Nextflow is an open source workflow orchestrator. I'll talk through in the demo how it integrates with Batch, but it has advanced container support for reproducible workflows. As Sean mentioned, that's really important for not only AstraZeneca, but also for many of our pharma customers and just scientific customers in general. We also use Seqera Platform, so that's a really nice web UI that they have for submitting and monitoring and managing jobs, and I'll show you that in the demo as well.

From an architecture standpoint within AWS Batch there's a construct of compute environments. So compute environments is where you specify the compute that you want your jobs to have access to. This is where we specify that we want the FPGA instance types, specifically those F2.6xlarge instances. We have two compute environments because, as Sean mentioned, we're running on two different versions of DRAGEN. Exomes run on 4.3.6. That's a 4.3.6 AMI. Genomes run on 3.7.8. That's a 3.7.8 AMI. And so we have a Batch queue mapped to the exome compute environment and a genome queue mapped to the genome compute environment.

From your local machine, if you're running it this way with Nextflow, you can just submit a single command: nextflow run, pointing to the run script in this case main.nf. In Nextflow, your top pipeline is defined in main.nf, and then you specify some parameters that you want to run with. In our case, we have a flag for whole exomes that tells Nextflow to submit to the batch whole exome queue so that those jobs land on that 4.3.6 version that we need to run on. We're also going to specify with tower and with report, which sends metadata to the web UI from Seqera that I mentioned. So I just submit this single command.

Nextflow is actually going to create an AWS Batch job definition and job, and it's going to submit that job to the queue as specified with the flag. On the back end, Batch is going to dynamically provision an EC2 instance in my account. The data sets are staged in S3, so once the instance is provisioned, it pulls a container down from the container registry. Input data sets are copied to that instance, the pipeline runs, and the output files are first copied back to S3, and then the instance is dynamically terminated. So it's all pay-as-you-go pricing. This type of automation and dynamic provisioning and termination is nice when you're running one pipeline at a time, and it's really important when you're running tens of thousands of samples as they are at AstraZeneca.

Live Demo Walkthrough and Key Learnings from the Implementation

In parallel, we're sending metadata to the Seqera Platform Launchpad so that we can monitor jobs through that UI. When I want to submit a genome sample, I just change the flag to type WGS, and again it just submits to that queue. Now we'll go through a quick demo video to show what this looks like in action. From my local terminal, I have my single command nextflow run main.nf. I'm going to specify that I want to submit it to batch.

We're just going to go over to the console and confirm there are no FPGA instances currently in the console in my account. I'm going to submit the job. Nextflow is going to output some information about the run first, my Nextflow version, and then notably it's going to output this unique string identifier, in this case stoic_nobel. When we go over to the web UI, this is the Seqera Launchpad web UI. There are no jobs. I hit refresh and you'll see the stoic_nobel tagged job running. When we click on it, this is just a quick snapshot of some of the metadata you can see for the job. It shows you the container that the job is running based on. This is really helpful for that reproducibility component.

It also shows you, for example, the paths to your input and output data sets for that job. Again, this is all just built in with Seqera and Nextflow functionality. If I go over to Batch in the console, I can see there's a Batch job submitted. It's in a runnable state, which means that the resources are being provisioned on the back end. The instance is still not there. It's going to take 2 minutes to provision. In the meantime, we can go over and look at what these run scripts look like.

There are two main scripts: main.nf and nextflow.config. main.nf is where you actually call DRAGEN from. We're going to specify the parameters that we want to run the pipeline with, including the input and output data set paths. Then here you can see within the script section that's where we're calling DRAGEN and where you specify the tools that you want to call from DRAGEN. This is main.nf, and then this is pointing to some key files. These are reference files. In this case, these are from Genie in a Bottle, which is an open data set.

nextflow.config is where you specify the Batch infrastructure components. These are the configurations I want to submit to Batch. You can have different profiles, so if you want to run your pipelines on Batch versus a different orchestrator versus even on-premises, you can configure that all in nextflow.config. This is where we set up our logic for that parameter flag. If I say WES, submit to the Batch queue for exomes. Now if we hit refresh, we can see the EC2 instance got provisioned dynamically just after submitting that job. Batch has a couple of different back ends. This is using ECS, our Elastic Container Service, which will provision EC2 instances in your account. If you used a serverless option like Fargate, it would take seconds to actually start the job. So there are different options for the back end. Kubernetes is also supported as a back end for Batch.

So in summary, this is a quick overview of how these pipelines were executed and that dynamic provisioning of resources. It's nifty tooling if you're doing testing, and it's crucial tooling if you're doing production runs at scale.

With tens of thousands of samples per month like AstraZeneca, we're going to cover a few learnings from this engagement, and I'll hand it back to Sean.

Thank you, Marissa, for that. I have two lessons or two points I wanted to pass on, which you probably know already, especially since anyone coming to this is probably in the genomics or DRAGEN world. Basically, equivalence testing builds trust with the scientists or with users—it could be any end users. So it's really important that we not only show this is faster and will cost less, but that we have exact equivalence. That's important to build trust with the end users.

The second point I wanted to make is that it's very important we keep data and compute in the same region. There are regulatory reasons to do that and data residency reasons. There's also egress costs—you don't want to incur a lot of data egress or S3 egress. The architecture that Marissa demonstrated, in my view, you could easily apply that to a multi-region architecture. If you put a flag at the beginning asking where your data is—if your data is in Virginia, that means you need to start your processing in Virginia as well. I think the architecture that was demonstrated facilitates that.

From the AWS side, we have a couple of learnings. If you go through the Illumina DRAGEN documentation for running these pipelines, you'll see that AWS recommends setting up a 2 terabyte EBS—elastic block storage—volumes with a 2 terabyte RAID zero configuration using 4 500-gigabyte volumes. We did some comparison testing of that versus local NVMe storage and found they are very similar, almost exactly the same from a performance standpoint.

Talking to Illumina, the guidance is essentially to use the local NVMe storage. There's almost 1 terabyte of local NVMe storage with the F2.6xlarge instance type. If your dataset fits within that capacity, use local NVMe. If it's larger, then you need to set up the EBS volumes. We also found that the publicly available guidance is helpful. In a nod to that and to contribute back to it, there's a blog post coming out probably in the next couple of weeks that has all the detail on the benchmarking and also on the concordance testing performed by AstraZeneca.

I want to thank John for the collaboration. It's been a joy, and thank you all for joining the session. Please fill out the survey in the mobile app.

; This article is entirely auto-generated using Amazon Bedrock.