Playing with JavaScript performances and DNA

#javascript #programming #performance

Biology is an interesting world and bioinformatic is where computer science meets biology,
Today I will describe a simple yet interesting bioinformatic problem from an algorithmic prospective.: Calculate the reverse complement of DNA bases using Javascript.

In this article, I am using a bioinformatic problem because is fun and interesting, but I will be mostly talking about JavaScript performance.
We will

Start describing how DNA works (with some big simplifications… Eih! I am not a biologist!),
Propose some implementations, and then
try to archive the best time performance, comparing the time for completing the task.

heads-up: A basic knowledge of JavaScript language is required to follow along.

What’s the reverse complement?

Before explaining it, bear with me for a small tour of how DNA looks like.

Let start with some concepts,

The DNA helix is composed of two strands like in the image above.
A strand is a long sequence of this for letters ATGC ( each letter is a specific nucleotide Adenine, Thymidine, Guanidine, Cytidine ) in some order.
There is a specific relation between what is the first strand and what there is on the second strand: for each A in the first sequence there is a T on the other strand and vice versa, and for each G a C will be on the other strand.

The conversion from map DNA strand to complementary strang would be something like:

'A': 'T',
'G': 'C',
'T': 'A',
'C': 'G'

Here is an example:

I often hear these two sequences named 5’ to 3’’ ( 3’ end) and the second string is named 3’ to 5’ (5’’ end). The direction of reading is in both from 5’ to 3’’ and this means that a sequence is read from left to right but the other one (the complementary) is read from right to left.

In most of the file formats and web APIs that I worked with since the complementary DNA strand can be calculated from the first strand sequence, only one DNA strand is provided ( 3’ end) and is up to us to calculate the complementary.

Now, we have enough for our small challenge:

How can I generate a complementary strand?

Given an input:
TTATACGACTCACTATAGGGAGACTTAAGAAG
The expected output should look like this:
CTTCTTAAGTCTCCCTATAGTGAGTCGTATAA

Remember: we are reading the complementary in reverse order so the DNA sequence starts TT the complementary will end with AA.

Input:

TT ATACGACTCACTATAGGGAGACTTAAGAAG
Output:

CTTCTTAAGTCTCCCTATAGTGAGTCGTAT AA

Ok, let make the code talks for us:

Let start with a modern approach, a map

const map = (sequence) => {
    const map = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
    return sequence.split('').reverse().map(bp => map[bp]).join("")
}

This is “easy” to read and the steps are :

We take the input
“TACGA”
We separate each char and create an array
[ ‘T’ , ’A’ , ’C’ , ’G’ , ’A’]
Then map each char into his complementary
[ ‘A’ , ’T’ , ’G’ , ’C’ , ’T’]
We reverse
[ ‘T’ , ’C’ , ’G’ , ’T’ , ’A’]
And the join in a string
“TCGTA”

That’s it… right?

In most cases, yes, but today we are a bit more stubborn and we will try to find the best performance time for this job.
Why? Well even a small bacterial DNA can range in size anywhere from 130 kbp to over 14 Mbp ( a bp is a single letter/Nucleotide) so being fast could be important.

Ok, we have the rules, now let me present our players :

Player1:
We just saw the map implementation, let call map,

const map = (sequence) => {
    const map = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
    return sequence.split('')
                   .reverse()
                   .map(bp => map[bp])
                   .join("")
}

Player2: for loop and with if:

const forLoop = (sequence) => {
    let complement = ''
    for (let idx = 0; idx < sequence.length; idx++) {
        if (sequence[idx] === 'A') {
            complement = 'T' + complement
        } else if (sequence[idx] === 'T') {
            complement = 'A' + complement
        } else if (sequence[idx] === 'G') {
            complement = 'C' + complement
        } else if (sequence[idx] === 'C') {
            complement = 'G' + complement
        }
    }
    return complement
}

Player3: A for with a switch case:

const forSwitch = (sequence) => {
    let complement = '';

    for (let idx = 0, sL = sequence.length; idx < sL; idx++) {
        switch (sequence[idx]) {
            case 'A':
                complement = 'T' + complement
                break;
            case 'T':
                complement = 'A' + complement
                break;
            case 'G':
                complement = 'C' + complement
                break;
            case 'C':
                complement = 'G' + complement
                break;
        }
    }
    return complement
}

We will run these implementations ( and some small variations, github for more details ), 10000 times on a 35752 long DNA sequence and record the best time, the worst time, and the overall average time.

Ready go!

This graph is not that easy to read, let me provide a table ordered by

Code	Average	Best	Worst
For (optimized) with switch case	0.9446	0.4836	99258,00
For with multiple if	21564,00	0.5540	867263,00
For (optimized) with each if	11737,00	0.6480	98886,00
For with dictionary	15038,00	11097,00	83742,00
ForEach with dictionary	23381,00	17202,00	70510,00
Big Map with regular expression	29884,00	23477,00	103878,00
Map with dictionary	34595,00	26937,00	137978,00
Replace with dictionary	237074,00	51751,00	3951461,00

It looks like “replace with dictionary” is the worst in timing, and “optimized switch case” is the best.

Wrapping up,
In this implementation I can see that:

The regular expression and the dictionary are slower than if and switch case
For is the faster loop
switch case wins on if else if
The optimized of ‘for loop’ give some small improvements

Bonus, (what optimized for means):

Maybe you already noted the ‘switch case’ implementation. During my review of this topic, I fell on this website ( https://browserdiet.com/ ) and learned something interesting about the for loop that I didn’t know.

for ([initialization]; [condition]; [final-expression]){
        Statement
}

Every time a ‘statement’ gets executed the condition block runs again.
This sounds clear, but also sequence.length will re-calculate each interaction, consuming more time, and this is bad!

And there is a simple solution,

We can instance a variable with the value of sequence.length
in the initialization block:

for (let idx = 0; idx < sequence.length; idx++) {
    // sequence.length is calculated every interaction
}

for (let idx = 0, sL = sequence.length; idx < sL; idx++) {
    // sequence.length is calculated only 1 time
}