DEV Community: Chillar Anand

Free DockerHub Alternative - ECR Public Gallery

Chillar Anand — Sun, 09 Feb 2025 16:08:34 +0000

DockerHub started rate limiting¹ anonymous docker pulls. When testing out a new CI/CD setup, I hit the rate limit and had to wait for an hour to pull the image. This was a good time to look for alternatives.

AWS ECR Public Gallery² is a good alternative to DockerHub as of today(2025 Feb). It is free and does not have rate limits even for anonymous users.

Once we find the required image from the gallery, we can simply change the image name in the docker pull command to pull the image from ECR Gallery.

docker pull public.ecr.aws/ubuntu/ubuntu

In Dockerfile, we can use the image from ECR Gallery as follows:

FROM public.ecr.aws/ubuntu/ubuntu

That is a quick way to avoid DockerHub rate limits.

Postman - Auto Login & Renew OAuth2 Token

Chillar Anand — Fri, 31 Jan 2025 17:20:17 +0000

Introduction

When using Postman to interact with APIs behind an OAuth2 authentication, we need to login and renew the token manually. This can be automated using the following steps.

Set credentials in environment variables
Create a pre-request script to login and renew the token
Use the token in the request headers

Automating Login & Renewal

var e = pm.environment;
var isSessionExpired = true;

var loginTimestamp = e.get("loginTimestamp");
var expiresInSeconds = pm.environment.get("expiresInSeconds") || 86400;

if (loginTimestamp) {
  var loginDuration = Date.now() - loginTimestamp;
  isSessionExpired = loginDuration >= expiresInSeconds;
}

if (isSessionExpired) {
  pm.sendRequest({
    url: e.get('host') + "/auth/connect/token",
    method: 'POST',
    header: {
      'Content-Type': 'application/x-www-form-urlencoded',
      'Accept': 'application/json'
    },
    body: {
        mode: 'urlencoded',
        urlencoded: [
          { key: "username", value: e.get('username') },
          { key: "password", value: e.get('password') },
          { key: "grant\_type", value: "password" },
          { key: "client\_id", value: e.get("client\_id") }
        ]
    }
  }, function (err, res) {
    jsonData = res.json();

    e.set("access\_token", jsonData.access\_token);

    if(res.json().expires\_in){
        expiresInSeconds = res.json().expires\_in \* 1000;
    }
    e.set("expiresInSeconds", expiresInSeconds);
    e.set("loginTimestamp", Date.now())
  });
}

We can copy this script to the pre-request script of the collection.

Most of the script is self-explanatory. The script checks if the session is expired and sends a request to the token endpoint to get a new token. The token is stored in environment variables and used in the request headers.

Conclusion

This is a one time setup for Postman collection and it saves a lot of time in the long run. The script can be modified to handle different grant types and token renewal strategies.

Install Cockpit on Remote Linux VM

Chillar Anand — Mon, 30 Dec 2024 22:54:07 +0000

Introduction

Cockpit is an easy to use web-based interface(like a cPanel) for managing Linux servers. When we want to provide access to non-developers or people who are new to linux, it is a good idea to get them started with Cockpit. It provides a user-friendly interface to manage services, containers, storage, logs, and more.

Setup

Let's create a new Ubuntu VM and install Cockpit on it.

sudo apt update
. /etc/os-release
sudo apt install -t ${VERSION\_CODENAME}-backports cockpit

Once the installation is complete, we can get the public ip of the VM and access the Cockpit web interface running on port 9090.

It will be difficult to remember the public ip of the VM. So, let's create a DNS record for the VM. Let's add an A record in DNS settings to point cockpit.avilpage.com to the public ip of the VM.

Reverse Proxy

Let's set up a reverse proxy to access the Cockpit web interface using a subdomain.

sudo apt install caddy

Add the below configuration to /etc/caddy/Caddyfile.

cockpit.avilpage.com {
    reverse_proxy localhost:9090
}

We need Origins to Cockpit configuration at /etc/cockpit/cockpit.conf to allow requests from the subdomain.

[WebService]
Origins = https://cockpit.avilpage.com

Restart both services and open https://cockpit.avilpage.com in browser.

sudo systemctl restart cockpit
sudo systemctl restart caddy

Conclusion

Cockpit web UI is a great tool to manage Linux servers even for non-developers. Users can browse/manage logs, services, etc. It also provides a terminal to run commands on the server

Cube & Cubicle

Chillar Anand — Thu, 31 Oct 2024 03:35:37 +0000

Rubiks Cube

When I was in college, I was traveling to a friend's place and missed bus at midnight. The next bus was at 4 AM. While I was bored waiting for the bus, I found Rubik's Cube in a shop.

I scrambled the cube and spent the next 4 hours trying to solve the cube. I managed to solve one color. When I tried to solve the next color, the pieces in the previous layer started missing.

Even after spending a lot of time in the next 3 weeks, I couldn't solve it and gave up.

After a couple of years, when I "learnt" about the internet, I searched and found simple algorithms to solve the cube. Within a few days, I was able to solve the cube in a minute.

Office Cubicles

In the final year of college, there were placements. When I was preparing resume, I included "I can solve Rubik's Cube in a minute" in it.

During the interview, interviewer asked me if I can really solve the cube in a minute. He asked me to get my cube and show him during the lunch break. I did. Luckily, I got hired.

Even though, I was hired for Wipro I didn't join. I went to Bangalore and started applying for start-up jobs.

I went for an interview at a web development company in Malleswaram, Bangalore. The CEO looked at my résumé, took out a cube from his desk. He handed the cube to me, showed an empty cubicle behind me and said, "If you solve the cube in a minute, that cubicle is yours."

Just by learning the cube, I was able to land a job an at an MNC(Multi National Company) and a startup as well.

tailscale: Resolving CGNAT (100.x.y.z) Conflicts

Chillar Anand — Sat, 07 Sep 2024 07:20:05 +0000

Introduction

In an earlier blog post, I wrote about using tailscale to remotely access any device¹. Tailscale uses 100.64.0.0/10 subnet² to assign unique IP addresses to each device.

When a tailscale node joins another campus network³ (schools, universities, offices) that uses the same subnet, it will face conflicts. Let's see how to resolve this.

Private Network

In the above scenario, node C1 will be able to connect C2 & C3 as they are in the same network.

Once we start tailscale on node C1, it will get a 100.x.y.z IP address from tailscale subnet. Now, node C1 will not be able to connect to node C2 & C3.

To avoid conflicts with the existing network, we can configure tailscale to use a "smaller" subnet using "ipPool".

{
    "acls": [
        "..."
    ],
    "nodeAttrs": [
        {
            "target": [
                "autogroup:admin"
            ],
            "ipPool": [
                "100.100.96.0/20"
            ]
        }
    ]
}

Once it is configured, taiscale will start assigning IP addresses from the new subnet. Even though ip address allocation is limited, we can't still access nodes in other subnets due to a bug⁵ in tailscale.

As a workaround, we can manually update the iptables to route traffic to the correct subnet.

Lets look at the iptables rules added by tailscale by stopping it and then starting it.

The highlighted rule drops any incoming packet that doesn't originate from tailscale0 interface, and source IP is 100.64.0.0/10 (100.64.0.0 to 100.127.255.255).

Let's delete this rule and add a new rule to restrict the source IP to 100.100.96.0/20 (100.100.96.1 to 100.100.111.254).

$ sudo iptables --delete ts-input --source 100.64.0.0/10 ! -i tailscale0 -j DROP
$ sudo iptables --insert ts-input 3 --source 100.100.96.0/20 ! -i tailscale0 -j DROP

Conclusion

By configuring tailscale to use a smaller subnet, we can avoid conflicts with existing networks. Even though there is a bug in tailscale, we can manually update iptables to route traffic to the correct subnet.

Mastering Kraken2 - Part 4 - Build FDA-ARGOS Index

Chillar Anand — Sat, 24 Aug 2024 09:58:00 +0000

Mastering Kraken2

Part 1 - Initial Runs

Part 2 - Classification Performance Optimisation

Part 3 - Build custom database indices

Part 4 - Buil FDA-ARGOS index (this post)

Part 5 - Regular vs Fast Builds (upcoming)

Part 6 - Benchmarking (upcoming)

Introduction

In the previous post, we learnt how to build a custom index for Kraken2.

FDA-ARGOS¹ is a popular database with quality reference genomes for diagnostic usage. Let's build an index for FDA-ARGOS.

FDA-ARGOS Kraken2 Index

FDA-ARGOS db is available at NCBI² from which we can download the assembly file.

We can extract accession numbers from the assembly file and then download the genomes from these accession ids.

$ grep -e "^#" -v PRJNA231221_AssemblyDetails.txt | cut -d$'\t' -f1 > accessions.txt

$ wc accessions.txt
 1428 1428 22848 accessions.txt

$ ncbi-genome-download --section genbank --assembly-accessions accessions.txt --progress-bar bacteria --parallel 40

It took ~8 minutes to download all the genomes, and the downloaded file size is ~4GB.

We can use kraken-db-builder³ tool to build index from these genbank genome files.

# kraken-db-builder needs this to convert gbff to fasta format
$ conda install -c bioconda any2fasta

$ kraken-db-builder --genomes-dir genbank --threads 36 --db-name k2_argos

It took ~30 minutes to build the index.

Conclusion

We have built a Kraken2 index for the FDA-ARGOS database on 2024-Aug-24.

FDA-ARGOS Library
Kraken2 Gzipped Index file (gzip size: 2.6GB, index size: 3.8GB, md5sum: 1dd946d2e405dfec35ed3e319e9dfeac)
Kraken2 Inspect file

In the next post, we will look at the differences between regular and fast builds.

Midnight Coding for Narendra Modi & Ivanka Trump

Chillar Anand — Sun, 18 Aug 2024 00:25:43 +0000

Introduction

In 2017, GES Event was held in Hyderabad, India. Narendra Modi (the Prime Minister of India) & Ivanka Trump (daughter of the then US President Donald Trump) were the chief guests.

At that time, I was part of Invento team, and we decided to develop a new version of Mitra robot for the event.

The Challenge

We had to develop the new version of Mitra robot in a short span of time. Entire team worked day and night to meet the deadlines and finish the new version.

We went to Hyderabad from Bangalore a few days before to prepare for the event. We have cleared multiple security checks, did some demos for various people before the event.

A day before the event, around 9 PM we discovered a critical bug in the software. Due to that bug, the Robot motors were running at full speed which was dangerous. If the robot hits someone at full speed, it could cause serious injuries.

I spent a few hours debugging the issue and even tried rolling back a few versions. Still, I couldn't pinpoint the issue.

Since we need only a small set of Robot features, we decided to create a new version of the software with only limited features. I spent the next few hours creating a new release.

After that, we spent the next few hours doing extensive testing to make sure there are no bugs in the new version.

It was almost morning by the time we were done with testing. We quickly went to hotel to have some rest and get back early for the event.

Conclusion

Mitra robot welcoming Modi & Trump went very well. You can read about Balaji Viswanathan's experience at GES 2017 on Quora¹.

Answer on Quora ↩

How (and when) to use systemd timer instead of cronjob

Chillar Anand — Mon, 05 Aug 2024 07:37:50 +0000

Introduction

* * * * * bash demo.sh

Just a single line of code is sufficient to schedule a cron job. However, there are some scenarios where I find systemd timer more useful than cronjob.

How to use systemd timer

We need to create a service file(contains the script to be run) and a timer(contains the schedule).

# demo.service
[Unit]
Description=Demo service

[Service]
ExecStart=bash demo.sh


# demo.timer
[Unit]
Description=Run myscript.service every 1 minutes

[Timer]
OnBootSec=1min
OnUnitActiveSec=1min

[Install]
WantedBy=multi-user.target

We can copy these files to /etc/systemd/system/ and enable the timer.

$ sudo cp demo.service demo.timer /etc/systemd/system/

$ sudo systemctl daemon-reload

$ sudo systemctl enable --now demo.timer

We can use systemctl to see when the task is executed last and when it will be executed next.

$ sudo systemctl list-timers --all

Use Cases

Singleton - In the above example, lets say demo.sh takes ~10 minutes to run. With cron job, in ten minutes we will have 10 instances of demo.sh running. This is not ideal. With systemd timer, it will ensure only one instance of demo.sh is running at a time.
On demand runs - If we want to test out the script/job, systemd allows us to immediately run it with usual systemctl start demo without needing to run the script manually.
Timer - With cron, we can run tasks upto a minute precision. Timer can run tasks till second level precision.

[Timer]
OnCalendar=*-*-* 15:30:15

In addition to that, we can run tasks based on system events. For example, we can run a script 15 minutes from reboot.

[Timer]
OnBootSec=15min

Conclusion

Systemd timer is a powerful tool that can replace cronjob in many scenarios. It provides more control and flexibility over cronjob. However, cronjob is still a good choice for simple scheduling tasks.

Mastering Kraken2 - Part 3 - Build Custom Database

Chillar Anand — Thu, 01 Aug 2024 05:22:30 +0000

Mastering Kraken2

Part 1 - Initial Runs

Part 2 - Classification Performance Optimisation

Part 3 - Building custom databases (this post)

Part 4 - Regular vs Fast Builds (upcoming)

Part 5 - Benchmarking (upcoming)

Introduction

In the previous post, we learned how to improve kraken2¹ classification performance. So far we have downloaded & used pre-built genome indices(databases).

In this post, let's build a custom database for kraken2. For simplicity, let's use only refseq archaea genomes² for building the index.

Building Custom Database

First, we need to download the taxonomy files. We can use the k2 script provided by kraken2.

$ k2 download-taxonomy --db custom_db

This takes ~30 minutes depending on the network speed. The taxonomy files are downloaded to the custom_db/taxonomy directory.

$ ls custom_db/taxonomy
citations.dmp division.dmp gencode.dmp merged.dmp nodes.dmp
nucl_wgs.accession2taxid delnodes.dmp gc.prt 
images.dmp names.dmp nucl_gb.accession2taxid readme.txt

$ du -hs custom_db/taxonomy
43G custom_db/taxonomy

For simplicity, let's use the archaea refseq genomes. We can use kraken2-build to download the refseq genomes.

$ k2 download-library --library archaea --db custom_db

This runs on a single thread. Instead of using kraken2-build, we can use ncbi-genome-download³ tool to download the genomes. This provides much granular control over the download process. For example, we can download only --assembly-levels complete genomes. We can also download multiple genomes in parallel.

$ pip install ncbi-genome-download

$ conda install -c bioconda ncbi-genome-download

$ ncbi-genome-download -s refseq -F fasta --parallel 40 -P archaea
Checking assemblies: 100%|███| 2184/2184 [00:19<00:00, 111.60entries/s]
Downloading assemblies: 100%|███| 2184/2184 [02:04<00:00, 4.54s/files]
Downloading assemblies: 2184files [02:23, 2184files/s]

In just 2 minutes, it has downloaded all the files. Lets gunzip the files.

$ find refseq -name "\*.gz" -print0 | parallel -0 gunzip

$ du -hs refseq
5.9G refseq

Lets add all fasta genome files to the custom database

$ time find refseq -name "\*.fna" -exec kraken2-build --add-to-library {} --db custom_db \;
667.46s user 90.78s system 106% cpu 12:54.80 total

kraken2-build doesn't use multiple threads for adding genomes to the database. In addition to that, it also doesn't check if the genome is already present in the database.

Let's use k2 for adding genomes to the database.

export KRAKEN\_NUM\_THREADS=40

$ find . -name "\*.fna" -exec k2 add-to-library --files {} --db custom_db \;
668.37s user 88.44s system 159% cpu 7:54.40 total

This took only half the time compared to kraken2-build.

Let's build the index from the library.

$ time kraken2-build --db custom_db --build --threads 36
Creating sequence ID to taxonomy ID map (step 1)...
Found 0/125783 targets, searched through 60000000 accession IDs...
Found 59923/125783 targets, searched through 822105735 accession IDs, search complete.
lookup_accession_numbers: 65860/125783 accession numbers remain unmapped, see unmapped.txt in DB directory
Sequence ID to taxonomy ID map complete. [2m1.950s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 5340021028 bytes
Capacity estimation complete. [23.875s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 11 bits reserved for taxid.
Completed processing of 59911 sequences, 3572145823 bp
Writing data to disk... complete.
Database files completed. [12m3.368s]
Database construction complete. [Total: 14m29.666s]
kraken2-build --db custom_db --build --threads 36 24534.98s user 90.50s system 2831% cpu 14:29.75 total

$ ls -ll
.rw-rw-r-- 5.3G anand 1 Aug 16:35 hash.k2d
drwxrwxr-x - anand 1 Aug 12:32 library
.rw-rw-r-- 64 anand 1 Aug 16:35 opts.k2d
.rw-rw-r-- 1.5M anand 1 Aug 16:22 seqid2taxid.map
.rw-rw-r-- 115k anand 1 Aug 16:23 taxo.k2d
lrwxrwxrwx 20 anand 1 Aug 12:31 taxonomy
.rw-rw-r-- 1.2M anand 1 Aug 16:22 unmapped.txt

We are able to build index for ~6GB input files in ~15 minutes.

Conclusion

We learnt some useful tips to speed up the custom database creation process. In the next post, we will learn about regular vs. fast builds.

Mastering kraken2 - Part 2 - Performance Optimisation

Chillar Anand — Sun, 28 Jul 2024 05:21:30 +0000

Mastering Kraken2

Part 1 - Initial Runs

Part 2 - Performance Optimisation (this post)

Part 3 - Custom Indices (upcoming)

Introduction

In the previous post, we learned how to set up kraken2¹, download pre-built indices, and run kraken2. In this post, we will learn various ways to speed up the classification process.

Increasing RAM

Kraken2 standard database is ~80GB in size. It is recommended to have at least db size RAM to run kraken2 efficiently². Let's use 128GB RAM machine and run kraken2 with ERR10359977³ sample.

$ time kraken2 --db k2_standard --report report.txt ERR10359977.fastq.gz > output.txt
Loading database information... done.
95064 sequences (14.35 Mbp) processed in 2.142s (2662.9 Kseq/m, 402.02 Mbp/m).
  94816 sequences classified (99.74%)
  248 sequences unclassified (0.26%)
kraken2 --db k2_standard --report report.txt ERR10359977.fastq.gz > 1.68s user 152.19s system 35% cpu 7:17.55 total

Now the time taken has come down from 40 minutes to 7 minutes. The classification speed has also increased from 0.19 Mbp/m to 402.02 Mbp/m.

The previous sample had only a few reads, and the speed is not a good indicator. Let's run kraken2 with a larger sample.

$ time kraken2 --db k2_standard --report report.txt --paired SRR6915097_1.fastq.gz SRR6915097_2.fastq.gz > output.txt
Loading database information... done.
Processed 14980000 sequences (2972330207 bp) ...
17121245 sequences (3397.15 Mbp) processed in 797.424s (1288.2 Kseq/m, 255.61 Mbp/m).
  9826671 sequences classified (57.39%)
  7294574 sequences unclassified (42.61%)
kraken2 --db k2_standard --report report.txt --paired > output.txt 526.39s user 308.24s system 68% cpu 20:23.86 total

This took almost 20 minutes to classify ~3 Gbp of data. Out of 20 minutes, 13 minutes was spent in classification. The remaining time in loading the db into memory.

Let's use k2_plusPF⁴ db, which is twice the size of k2_standard and run kraken2.

$ time kraken2 --db k2_plusfp --report report.txt --paired SRR6915097_1.fastq.gz SRR6915097_2.fastq.gz > output.txt
Loading database information...done.
17121245 sequences (3397.15 Mbp) processed in 755.290s (1360.1 Kseq/m, 269.87 Mbp/m).
  9903824 sequences classified (57.85%)
  7217421 sequences unclassified (42.15%)
kraken2 --db k2_plusfp/ --report report.txt --paired SRR6915097_1.fastq.gz > 509.71s user 509.51s system 55% cpu 30:35.49 total

This took ~30 minutes to complete, but the classification took only 13 minutes similar to k2_standard. The remaining time was spent in loading the db into memory.

Preloading db into RAM

We can use vmtouch⁵ to preload db into RAM. kraken2 provides --memory-mapping option to use preloaded db.

$ vmtouch -vt k2_standard/hash.k2d k2_standard/opts.k2d k2_standard/taxo.k2d
           Files: 3
     Directories: 0
   Touched Pages: 20382075 (77G)
         Elapsed: 434.77 seconds

Now, let's run kraken2 with --memory-mapping option.

$ time kraken2 --db k2_standard --report report.txt --memory-mapping --paired SRR6915097_1.fastq.gz SRR6915097_2.fastq.gz > output.txt
Loading database information... done.
17121245 sequences (3397.15 Mbp) processed in 532.486s (1929.2 Kseq/m, 382.79 Mbp/m).
  9826671 sequences classified (57.39%)
  7294574 sequences unclassified (42.61%)
  kraken2 --db k2_standard --report report.txt --paired SRR6915097_1.fastq.gz > 424.20s user 11.76s system 81% cpu 8:54.98 total

Now the classification took only ~10 minutes.

Multi threading

kraken2 supports multiple threads. I am using a machine with 40 threads.

$ time kraken2 --db k2_standard --report report.txt --paired SRR6915097_1.fastq.gz SRR6915097_2.fastq.gz --memory-mapping --threads 32 > output.txt
Loading database information... done.
17121245 sequences (3397.15 Mbp) processed in 71.675s (14332.5 Kseq/m, 2843.81 Mbp/m).
  9826671 sequences classified (57.39%)
  7294574 sequences unclassified (42.61%)
kraken2 --db k2_standard --report report.txt --paired SRR6915097_1.fastq.gz 556.58s user 22.85s system 762% cpu 1:16.02 total

With 32 threads, the classification took only 1 minute. Beyond 32 threads, the classification time did not decrease significantly.

Optimising input files

So far we have used gzipped input files. Let's use unzipped input files and run kraken2.

$ gunzip SRR6915097_1.fastq.gz
$ gunzip SRR6915097_2.fastq.gz

$ time kraken2 --db k2_standard --report report.txt --paired SRR6915097_1.fastq SRR6915097_2.fastq --memory-mapping --threads 30 > output.txt
Loading database information... done.
17121245 sequences (3397.15 Mbp) processed in 34.809s (29512.0 Kseq/m, 5855.68 Mbp/m).
  9826671 sequences classified (57.39%)
  7294574 sequences unclassified (42.61%)
kraken2 --db k2_standard --report report.txt --paired SRR6915097_1.fastq 30 565.03s user 17.12s system 1530% cpu 38.047 total

Now the classification time has come down to 40 seconds.

Since the input fastq files are paired, interleaving the files also takes time. Lets interleave the files and run kraken2.

To interleave the files, lets use seqfu tool.

$ conda install -y -c conda-forge -c bioconda "seqfu>1.10"

$ seqfu interleave -1 SRR6915097_1.fastq.gz -2 SRR6915097_2.fastq.gz > SRR6915097.fastq

$ time kraken2 --db k2_standard --report report.txt --memory-mapping SRR6915097.fq --threads 32 > output.txt
Loading database information... done.
34242490 sequences (3397.15 Mbp) processed in 20.199s (101714.1 Kseq/m, 10090.91 Mbp/m).
  17983321 sequences classified (52.52%)
  16259169 sequences unclassified (47.48%)
kraken2 --db k2_standard --report report.txt --memory-mapping SRR6915097.fq 32 618.96s user 18.24s system 2653% cpu 24.013 total

Now the classification time has come down to 24 seconds.

Conclusion

In terms of classification speed, we have come a long way from 0.1 Mbp/m to 1200 Mbp/m. In the next post, we will learn how to optimise the creation of custom indices.

Mastering Kraken2 - Part 1 - Initial Runs

Chillar Anand — Sun, 28 Jul 2024 05:14:25 +0000

Mastering Kraken2

Part 1 - Initial Runs (this post)

Part 2 - Performance Optimisation

Part 3 - Custom Indices (upcoming)

Introduction

Kraken2¹ is widely used for metagenomics taxonomic classification, and it has pre-built indexes for many organisms. In this series, we will learn

How to set up kraken2, download pre-built indices
Run kraken2 (8GB RAM) at ~0.19 Mbp/m (million base pairs per minute)
Learn various ways to speed up the classification process
Run kraken2 (128GB RAM) at ~1200 Mbp/m
Build custom indices

Installation

Let's start with an 8 GB RAM machine. We can install kraken2 using the install_kraken2.sh script as per the manual².

$ git clone https://github.com/DerrickWood/kraken2
$ cd kraken2
$ ./install_kraken2.sh /usr/local/bin
# ensure kraken2 is in the PATH
$ export PATH=$PATH:/usr/local/bin

If you already have conda installed, you can install kraken2 from conda as well.

$ conda install -c bioconda kraken2

Download pre-built indices

Building kraken2 indices take a lot of time and resources. For now, let's download and use the pre-built indices. In the final post, we will learn how to build the indices.

Genomic Index Zone³ provides pre-built indices for kraken2. Let's download the standard database. It contains Refeq archaea, bacteria, viral, plasmid, human1, & UniVec_Core.

$ wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20240605.tar.gz
$ mkdir k2_standard
$ tar -xvf k2_standard_20240605.tar.gz -C k2_standard

The extracted directory contains three files - hash.k2d, opts.k2d, taxo.k2d which are the kraken2 database files.

$ ls -l *.k2d
.rw-r--r-- 83G anand 13 Jul 12:34 hash.k2d
.rw-r--r-- 64 anand 13 Jul 12:34 opts.k2d
.rw-r--r-- 4.0M anand 13 Jul 12:34 taxo.k2d

Classification

To run the taxonomic classification, let's use ERR10359977 human gut meta genome from NCBI SRA.

$ wget https://ftp.sra.ebi.ac.uk/vol1/fastq/ERR103/077/ERR10359977/ERR10359977.fastq.gz
$ kraken2 --db k2_standard --report report.txt ERR10359977.fastq.gz > output.txt

By default, the machine I have used has 8GB RAM and an additioinal 8GB swap. Since kraken2 needs entire db(~80GB) in memory, when the process tries to consume more than 16GB memory, the kernel will kill the process.

$ time kraken2 --db k2_standard --paired SRR6915097_1.fastq.gz SRR6915097_2.fastq.gz > output.txt
Loading database information...Command terminated by signal 9
0.02user 275.83system 8:17.43elapsed 55%CPU

To prevent this, let's increase the swap space to 128 GB.

# Create an empty swapfile of 128GB
sudo dd if=/dev/zero of=/swapfile bs=1G count=128

# Turn swap off - It might take several minutes
sudo swapoff -a

# Set the permissions for swapfile
sudo chmod 0600 /swapfile

# make it a swap area
sudo mkswap /swapfile  

# Turn the swap on
sudo swapon /swapfile

We can time the classification process using the time command.

$ time kraken2 --db k2_standard --report report.txt ERR10359977.fastq.gz > output.txt

If you have a machine with large RAM, the same scenario can be simulated using systemd-run. This will limit the memory usage of kraken2 to 6.5GB.

$ time systemd-run --scope -p MemoryMax=6.5G --user time kraken2 --db k2_standard --report report.txt ERR10359977.fastq.gz > output.txt

Depending on the CPU performance, this will take around ~40 minutes to complete.

Loading database information... done.
95064 sequences (14.35 Mbp) processed in 1026.994s (5.6 Kseq/m, 0.84 Mbp/m).
  94939 sequences classified (99.87%)
  125 sequences unclassified (0.13%)
  4.24user 658.68system 38:26.78elapsed 28%CPU

If we try gut WGS(Whole Genome Sequence) sample like SRR6915097 which contains ~3.3 Gbp, it will take weeks to complete.

$ time systemd-run --scope -p MemoryMax=6G --user time kraken2 --db k2_standard --paired SRR6915097_1.fastq.gz SRR6915097_2.fastq.gz > output.txt

I tried running this on 8 GB machine. Even after 10 days, it processed only 10% of the data.

If we have to process a large number of such samples, it takes months and this is not a practical solution.

Conclusion

In this post, we ran kraken2 on an 8GB machine and learned that it is not feasible to run kraken2 on large samples.

In the next post, we will learn how to speed up the classification process and run classification at 1200 Mbp/m.

Next : Part 2 - Performance Optimisation

Headlamp - k8s Lens open source alternative

Chillar Anand — Sun, 23 Jun 2024 20:18:02 +0000

Since Lens is not open source, I tried out monokle, octant, k9s, and headlamp¹. Among them, headlamp UI & features are closest to Lens.

Headlamp

Headlamp is CNCF sandbox project that provides cross-platform desktop application to manage Kubernetes clusters. It auto-detects clusters and provides cluster wide resource usage by default.

It can also be installed inside the cluster and can be accessed using a web browser. This is useful when we want to access the cluster from a mobile device.

$ helm repo add headlamp https://headlamp-k8s.github.io/headlamp/

$ helm install headlamp headlamp/headlamp

Lets port-forward the service & copy the token to access it.

$ kubectl create token headlamp

# we can do this via headlamp UI as well
$ kubectl port-forward service/headlamp 8080:80

Now, we can access the headlamp UI at http://localhost:8080.

Conclusion

If you are looking for an open source alternative to Lens, headlamp is a good choice. It provides a similar UI & features as Lens and it is accessible via mobile devices as well.

https://headlamp.dev/ ↩