Chillar Anand

Posted on Aug 23, 2024 • Originally published at avilpage.com on Jul 28, 2024

Mastering Kraken2 - Part 1 - Initial Runs

#devops #kraken2 #metagenomics

Mastering Kraken2

Part 1 - Initial Runs (this post)

Part 2 - Performance Optimisation

Part 3 - Custom Indices (upcoming)

Introduction

Kraken2¹ is widely used for metagenomics taxonomic classification, and it has pre-built indexes for many organisms. In this series, we will learn

How to set up kraken2, download pre-built indices
Run kraken2 (8GB RAM) at ~0.19 Mbp/m (million base pairs per minute)
Learn various ways to speed up the classification process
Run kraken2 (128GB RAM) at ~1200 Mbp/m
Build custom indices

Installation

Let's start with an 8 GB RAM machine. We can install kraken2 using the install_kraken2.sh script as per the manual².

$ git clone https://github.com/DerrickWood/kraken2
$ cd kraken2
$ ./install_kraken2.sh /usr/local/bin
# ensure kraken2 is in the PATH
$ export PATH=$PATH:/usr/local/bin

If you already have conda installed, you can install kraken2 from conda as well.

$ conda install -c bioconda kraken2

Download pre-built indices

Building kraken2 indices take a lot of time and resources. For now, let's download and use the pre-built indices. In the final post, we will learn how to build the indices.

Genomic Index Zone³ provides pre-built indices for kraken2. Let's download the standard database. It contains Refeq archaea, bacteria, viral, plasmid, human1, & UniVec_Core.

$ wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20240605.tar.gz
$ mkdir k2_standard
$ tar -xvf k2_standard_20240605.tar.gz -C k2_standard

The extracted directory contains three files - hash.k2d, opts.k2d, taxo.k2d which are the kraken2 database files.

$ ls -l *.k2d
.rw-r--r-- 83G anand 13 Jul 12:34 hash.k2d
.rw-r--r-- 64 anand 13 Jul 12:34 opts.k2d
.rw-r--r-- 4.0M anand 13 Jul 12:34 taxo.k2d

Classification

To run the taxonomic classification, let's use ERR10359977 human gut meta genome from NCBI SRA.

$ wget https://ftp.sra.ebi.ac.uk/vol1/fastq/ERR103/077/ERR10359977/ERR10359977.fastq.gz
$ kraken2 --db k2_standard --report report.txt ERR10359977.fastq.gz > output.txt

By default, the machine I have used has 8GB RAM and an additioinal 8GB swap. Since kraken2 needs entire db(~80GB) in memory, when the process tries to consume more than 16GB memory, the kernel will kill the process.

$ time kraken2 --db k2_standard --paired SRR6915097_1.fastq.gz SRR6915097_2.fastq.gz > output.txt
Loading database information...Command terminated by signal 9
0.02user 275.83system 8:17.43elapsed 55%CPU

To prevent this, let's increase the swap space to 128 GB.

# Create an empty swapfile of 128GB
sudo dd if=/dev/zero of=/swapfile bs=1G count=128

# Turn swap off - It might take several minutes
sudo swapoff -a

# Set the permissions for swapfile
sudo chmod 0600 /swapfile

# make it a swap area
sudo mkswap /swapfile  

# Turn the swap on
sudo swapon /swapfile

We can time the classification process using the time command.

$ time kraken2 --db k2_standard --report report.txt ERR10359977.fastq.gz > output.txt

If you have a machine with large RAM, the same scenario can be simulated using systemd-run. This will limit the memory usage of kraken2 to 6.5GB.

$ time systemd-run --scope -p MemoryMax=6.5G --user time kraken2 --db k2_standard --report report.txt ERR10359977.fastq.gz > output.txt

Depending on the CPU performance, this will take around ~40 minutes to complete.

Loading database information... done.
95064 sequences (14.35 Mbp) processed in 1026.994s (5.6 Kseq/m, 0.84 Mbp/m).
  94939 sequences classified (99.87%)
  125 sequences unclassified (0.13%)
  4.24user 658.68system 38:26.78elapsed 28%CPU

If we try gut WGS(Whole Genome Sequence) sample like SRR6915097 which contains ~3.3 Gbp, it will take weeks to complete.

$ time systemd-run --scope -p MemoryMax=6G --user time kraken2 --db k2_standard --paired SRR6915097_1.fastq.gz SRR6915097_2.fastq.gz > output.txt

I tried running this on 8 GB machine. Even after 10 days, it processed only 10% of the data.

If we have to process a large number of such samples, it takes months and this is not a practical solution.

Conclusion

In this post, we ran kraken2 on an 8GB machine and learned that it is not feasible to run kraken2 on large samples.

In the next post, we will learn how to speed up the classification process and run classification at 1200 Mbp/m.

Next : Part 2 - Performance Optimisation

DEV Community

Mastering Kraken2 - Part 1 - Initial Runs

Mastering Kraken2

Introduction

Installation

Download pre-built indices

Classification

Conclusion

Top comments (0)

Read next

Creating an Automation Pipeline with AWS CodeBuild for Robot Framework

It’s cybersecurity’s kryptonite: Why are you still holding it?

Automated Load Testing in CI/CD Pipelines: Lessons from a Custom Bash Script

PHP 8.4 is here ⚡️upgrade your projects today!