Samantha Servo for AWS Community Builders

Posted on Nov 3

From Lab PC to Serverless: DNA Sequence Alignment on AWS

#bioinformatics #aws #communitybuilder #beyondthevinculum

Overview

This blog walks you through how a simple DNA alignment project can evolve through three labs of improvement with the cloud:

[Lab 1] Local Setup: Everything runs locally; manual, limited, and prone to errors.
[Lab 2] AWS Lift & Shift (“Pa-cloud na ‘yan!”): You’ll migrate your script to an EC2 instance and use S3 for storage.
[Lab 3] Serverless Alignment (“Walang tulog si Lambda”): You’ll build a fully serverless architecture using Lambda, S3, and DynamoDB; no servers to manage.

Scenario

You’re part of Project GenomePH, a small research group at a state university in the Philippines studying genetic markers for diseases among Filipinos.

Your current team setup consists of:

Borrowed, legacy lab PC that malfunctions when it senses your anxiety
A few external hard drives from some of your members
A spreadsheet software used to track sequence file names (without versioning)
Every week, someone says, “Wait, sino may latest copy ng FASTA file?” (Wait, who has the latest copy of the FASTA file?)

You are their new recruit, and you didn’t expect this chaotic scenario! Modernizing this setup is good, but you are all still students who want to finish your research on time.

In short, you will need the following:

Cost-Optimized architecture (because you are all still students!)
High-Performing architecture (so no need to wait for hours hoping the process won’t fail)
Resilient architecture (no need to feel anxiety everytime you run your sequencing script)
Secure architecture (of course, it’s best practice to ensure security of your analyses!)

Note:
This lab is a situational example designed for learning purposes only.

Real-world labs typically rely on established bioinformatics tools such as BLAST, BWA, or Bowtie, often orchestrated through HPC clusters or specialized workflow managers. The intent here is to simulate the process conceptually to demonstrate how cloud architecture can scale and automate scientific workloads.

For this POC, we use the following tech stack:

To give you a headstart, I have a repository for the labs. You can access the Github repository here.

Lab 1: Simple deployment [without cloud] - “Pa-demo muna!” (Demo first!)

Goal: Run your DNA alignment app locally to simulate your current scenario in your lab.

To Do:

Clone the repo.
Use a Python script that aligns two sample DNA sequences (e.g., Needleman-Wunsch or Smith-Waterman algorithm).
Save results locally as .txt or .csv.

This is good and all… for small research teams. Local testing first… but you’ll soon realize the following:

It only works if your laptop is online and running
Large datasets slow everything down
Files are stored locally (At risk of losing your files!)

Also, remember that the research lab only uses one lab PC? With large FASTA files, it will take you longer to analyze sequences!

How about the cost?

AWS Bill: $0.00
True Cost (Total Cost of Ownership): This is the most expensive model.

Capital Expenditure (CapEx): The lab PC itself is a large, upfront cost.
Risk: Data loss from a single failed hard drive or a malfunctioning PC means losing weeks or months of research.
Opportunity Cost: A slow, manual process that is "prone to errors" means researchers are spending time managing IT instead of doing science.

Lab 2: AWS Lift & Shift - “Pa-Cloud na yan!”

Goal: Host the same app on an EC2 instance, store results in S3 bucket.

This is where we start involving the cloud. For now, we can rely on AWS’ Free Tier and use t3 micro as our instance size. Be careful of using larger sizes, you will be billed once it’s running!

Before diving in, let’s understand the key cloud concepts behind this setup.

Lift & Shift Concept
What is an EC2 instance? Which one to use?
What is an S3 Bucket?
IAM Role?

The Lift and Shift Concept is the simplest form of cloud migration. You will be “lifting” your app from your device and “shifting” it to the cloud without changing much of the code. You can read more here!

Why should we do this? It’s because benefits like better uptime, easier scaling, and remote access are all available while using the same app you already know.

How about an EC2 instance? Amazon EC2 (Elastic Compute Cloud) is basically a virtual computer in AWS’ data center. It’s like renting a PC or powerful virtual server in the cloud. You can also customize it according to your use. AWS will manage the underlying physical hardware and you will manage the virtual machine or the instance.

Things to note about Amazon EC2:

You can SSH into it, install Python, Biopython, and your alignment script.
Choose an instance type based on your needs. In this lab, we’ll use t3.micro (Free Tier eligible!).
You’re billed only while it’s running, so remember to stop it when you’re done.

Lastly, we have the Amazon S3 (Simple Storage Service). This is where you’ll store your files like, input FASTA files, output results, logs, anything.

Things to note about Amazon S3:
You can upload/download data anytime.

It’s highly durable (your files are stored across multiple facilities).
It’s pay-as-you-go and perfect for scientific data.
You can allow bucket versioning which can help you track the version of the bucket you are working with.

Lastly, think about the security aspect as well! We have the IAM (Identity and Access Management) service for this. For this lab, we need to know the IAM Role. The instances use IAM roles to securely access S3. But of course, let’s tackle the other components of IAM to further understand it.

IAM Components are:

Users - Individual identities for people or apps. They can log in with their user credentials.
Groups - Collection of users with the same permissions.
Roles - Temporary permissions that can be assumed by users, services, or applications
Policies - Rules that define what actions are allowed or denied

Furthermore, for this lab, you can use the default VPC (Virtual Private Cloud) that AWS provides:

subnets, an internet gateway, and route tables.
Security Groups act as virtual firewalls to control traffic (like allowing to SSH from your IP).

However, ideally you should create a custom VPC for production or larger projects. Read more here.

What happened here?

You’ll run your DNA script on an EC2 machine instead of the lab computer. Once it finishes aligning sequences, the result file will be uploaded to S3 for safekeeping.

Let’s compare this to lab 1 and see where it has improved.

Now for the cost:
AWS Bill (Estimated): $0.00 (if managed within the AWS Free Tier)
Cost Model: Pay-as-you-go. You pay only for what you use, reducing idle costs compared to the fixed cost of the lab PC.

Service Breakdown:

Amazon EC2: We used a t3.micro instance, which is eligible for the AWS Free Tier (750 hours per month for the first 12 months). As long as you stop the instance when you're done, you won't be billed for compute.
Amazon S3: The Free Tier includes 5 GB of standard storage. Our tiny FASTA files are mere kilobytes, so storage costs are effectively zero. You pay a very small amount for data transfer (per-GB), but it will be fractions of a cent. Cost Risk: The main risk is forgetting to stop the EC2 instance. If you leave it running 24/7, you will be billed after you exhaust the 750 free hours.

After you ssh into the ec2 instance, this is how it looks:

Make sure the necessary files are copied into this instance and the necessary packages as well (if the user_data.sh didn't install it yet)

Upload 2 sample fasta files in your S3 Bucket through the console:

Run the script and this is the result (added output folder for results)

Open that result with notepad* and you are now done with Lab 2!

*This is how it looks

Lab 3: Serverless Alignment - “Walang tulog si Lambda” (Lambda has no sleep”

Goal: Turn your DNA alignment script into a fully serverless workflow using AWS Lambda, DynamoDB, and S3. You don’t have any servers to manage, it’s automatic scaling, and pay-only-for-usage.

Let’s run through the services and concepts used!

Amazon S3
AWS Lambda
Amazon DynamoDB

You already know about Amazon S3, so let’s tackle the other ones which will help you do this lab in a serverless way.

Things To Do:

Store DNA input files (like your FASTA files) in an S3 bucket.

When a new file is uploaded, it will automatically trigger your workflow.
Why: S3 is cheap, durable, and perfect for file-based pipelines.

Run your alignment logic using AWS Lambda.

Lambda gets triggered by the S3 event.
It reads the input file, performs the DNA sequence alignment (sample logic provided in Python), and produces an output.
Why: Lambda scales automatically; even if you upload 100 files, AWS will run them in parallel without you managing any EC2 instances.

Save alignment results (e.g., JSON or CSV) back to S3 under an /outputs/ folder.

Why: Keeps all results organized, durable, and accessible for downstream analysis or sharing.

Record job status and metadata in DynamoDB (e.g., filename, alignment score, timestamp).

Why: DynamoDB acts as your lightweight job tracker so that you will know what’s been processed, the monitoring progress, or building dashboards later.

In this lab, you have transformed your cloud DNA alignment setup into a fully managed, event-driven architecture. No servers, no maintenance, no idle costs anymore!

But the cost?

AWS Bill (Estimated): $0.00 (and almost guaranteed to stay that way)
Cost Model: Pay-per-request. This is the ultimate in cost efficiency. You are only charged when an alignment executes. There is no idle compute cost whatsoever.

Service Breakdown:

AWS Lambda: Has a perpetual Free Tier of 1 million free requests and 400,000 GB-seconds of compute time per month. For a student project, it is practically impossible to exceed this.
Amazon DynamoDB: Also has a perpetual Free Tier (25 GB of storage, 25 Read Capacity Units, 25 Write Capacity Units). This is more than enough to store metadata for millions of alignment jobs.
Amazon S3: Same as Lab 2. Cost Risk: Virtually zero. This architecture can handle zero alignments or 1,000 alignments in parallel and scales costs perfectly from $0. This is the ideal model for event-driven workloads like a research project.

Upload the one fasta file containing 2 sequences to be compared

After the lambda function is invoked, you will have a table in DynamoDB. Check its table contents and you will see an item added. It has the filename with its corresponding alignment score.

What Happened to the Detailed Report?

In Lab 2, our EC2 instance ran the needle program, which generated a detailed, human-readable .txt report. This was great, but it has a big drawback: if we wanted to find the score for one file, we'd have to download and open every single text file. This DOES NOT scale.

The goal of Lab 3 was to build a high-performance, scalable, event-driven pipeline. In this serverless architecture, we're now treating the alignment result as a piece of data to be indexed.

Our Lambda function was designed to do two things:

Run the alignment using parasail to get the final score.
Write this score (the most important piece of metadata) to a DynamoDB table.

So, the "report" for Lab 3 is the new item in DynamoDB.

By storing the score in DynamoDB, we can instantly find the score for any filename or build a dashboard that shows all jobs that have been processed.

In a real-world application, you could have the Lambda function do both: save the score to DynamoDB (for the dashboard) and save a detailed report to an "output" S3 folder (for scientific review). But for this project, getting the score into DynamoDB proves the entire serverless architecture is now a success! :>

A Tip for Lab 3: Solving the parasail Dependency Problem

When I moved from Lab 2 (EC2) to Lab 3 (Serverless Lambda), I hit a major roadblock. The emboss package was easy to install on an EC2 instance with apt-get, but you can't do that in AWS Lambda.

The modern Python replacement, parasail, is a powerful library, but it's not just pure Python. It contains C code that must be compiled.

The Problem: My laptop (Windows/macOS) is a different operating system than AWS Lambda (which runs on Amazon Linux). If I ran pip install parasail on my machine and zipped the folder, I'd be zipping a file compiled for Windows. When I deployed it, Lambda would fail because it couldn't execute the wrong file type. This is what caused my No module named 'parasail' errors. I was stressed the whole process and figured why not try Docker here?

The Solution: Use Docker as a "Build Environment" I used Docker as a disposable, clean-room build environment.

I used an official AWS Docker image that perfectly matches the Lambda runtime: public.ecr.aws/lambda/python:3.9.

By running a single Docker command, I told it to:

Start a temporary Amazon Linux container.
Mount my local lambda_function folder into the container's /var/task directory.
Run pip install -r /var/task/requirements.txt -t /var/task/ inside the container.
Save the resulting compiled libraries (parasail, numpy, etc.) back to my local folder.

This gave me a lambda_function folder filled with libraries that were perfectly compiled for AWS Lambda.

The Takeaway: This is a perfect example of using Docker as a build-time utility to solve dependency issues. A container was used to ensure our build was repeatable and correct for our target platform.

And you’re done!

You have learned how to think from simply, “it works” to “it works and is sustainable”. To see it more in a practical way, you can check the following bioinformatics/genomics systems that use AWS:

ElasticBLAST - A cloud-based wrapper / orchestration service for BLAST which helps accelerate sequence searches using the cloud
GATK on Cloud - Genome Analysis Toolkit pipelines for variant calling, etc.
AWS HealthOmics - of course, AWS’ own genomics and health platform.

Hold on a moment…

Our Lab 3 serverless pipeline is a fantastic, cost-effective solution for fast, event-driven tasks. But what happens when we analyze a whole genome? An alignment job can take hours, not seconds.

Our Lambda function will fail. It has a 15-minute timeout.

This reveals something that we should take note of with large-scale cloud computing: we must separate the Orchestration from the Computation part.

We can package our heavy tools (BWA, GATK, etc.) into a portable container that can run for hours and has all the right dependencies.

A service like AWS Batch is built to run these containers. But the most powerful, flexible, and cloud-agnostic platform for managing containers at scale is Kubernetes.

In my next project, I'll dive into that world. We'll build a production-grade, container-based pipeline on Amazon EKS (Elastic Kubernetes Service). We can also explore the existing workflow managers such as Nextflow or Snakemake. We’ll see, so stay tuned! Beyond the Vinculum! :>

DEV Community