DEV Community

Jennifer Meneghin
Jennifer Meneghin

Posted on

Calculating tetranucleotide (k-mer) frequencies

Last week I posted about a Perl script I wrote that calculates %GC content for genome sequences. This week, I’m generalizing this idea to k-mers. In bioinformatics, k-mers are substrings of length k in biological sequences.

In 2016 I wrote a Perl script for the Reysenbach lab that calculates k-mer frequencies, for all k-mers in a given sequence. It was made public soon after, and you can find get_kmer_frequencies.pl on my Github page.

Typically, in the lab, we were interested in tetranucleotide frequencies. Tetranucleotides are k-mers of length four. Tetranucleotide frequencies have been used in the analysis of microbial genomes for many years [1], and they are often used in the process of binning contigs in metagenomics [2].

Originally, I wrote a script called get_tetranucleotide_frequencies.pl that was used in this PhD dissertation:

Buerger, P. 2017. Viruses: contributors to and mitigators of black band disease in corals. James Cook University, Queensland, Australia.

The newer get_kmer_frequencies.pl script was used to calculate tetranucleotide frequencies in this Master’s thesis:

St. John, E. 2019. Symbiosis in Archaea: Functional and Phylogenetic Diversity of Marine and Terrestrial Nanoarchaeota and their Hosts.

The new script allows you to calculate k-mers of any length. This script takes three parameters; a fasta file, k and a file name prefix:

get_kmer_frequencies.pl infile.fasta k fileprefix

It returns a tab delimited file called fileprefix_kmer.txt, where fileprefix is the prefix you specified in the parameters. The columns are fasta records, and rows are k-mer counts. Notice all k-mer counts are given here — so if your data contains any other symbols besides A, C, G, and T, those will be included in the k-mer combinations. Extra rows are an indication of dirty data, and I would remove these rows for analysis purposes.

The newer get_kmer_frequencies.pl script also takes into consideration that reverse complimented k-mers should be summed together.

"DNA occurs as a double strand where each A is paired with a T and vice versa, and each C is paired with a G and vice versa. The reverse complement of a DNA sequence is formed by reversing the letters, interchanging A and T and interchanging C and G. Thus the reverse complement of ACCTGAG is CTCAGGT.” From A Computer Scientist’s Dictionary for Genomics

The fasta sequences we look at on the computer only contain the nucleotides of a single strand of DNA, while actual DNA is double stranded. So, if our fasta sequence DNA has the k-mer AAAA 20 times, for example, then the other strand in actual reality should have TTTT 20 times. But, that second strand also contains AAAA, say 12 times for example, and then the first strand should also contain TTTT 12 times. So, in double stranded reality, AAAA/TTTT shows up 32 times on the double stranded string (for this example), sometimes with AAAA on the first strand, sometimes on the second strand. To take this into consideration, we sum the reverse complements.

For tetranucleotides, this means instead of 256 possible combinations (4 to the 4th power — given there are 4 nucleotides and a k-mer sequence length of 4), there are actually only 136. Why not 128? Because 16 tetranucleotide sequences are reverse complements of themselves (e.g., AATT and CATG). So, we have (256-16)/2 = 120 tetranucleotides with reverse complements, and then we add back in the tetranucleotides without reverse complements to get 120+16 = 136 total tetranucleotides.

These calculations are similar for any length k-mer. We only find k-mers with self-reverse-complements with even length k-mers. So, for length 3, we have 4 to the 3rd power divided by 2, or 64/2 = 32 possible 3-mers. For length 1, we have 4 to the 1st power divided by 2, or 2 1-mers, A/T and C/G. Bringing this full circle, if we express our 1-mers as a percentage, we only need to look at A/T or C/G. If we choose C/G, that’s what’s known as %GC content.

In 2020 I rewrote several of my Perl scripts in Python, including get_kmer_frequencies.py, which can be run as follows.

get_kmer_frequencies.py -i infile.fasta -k k -p fileprefix

This script is available for download on my GitHub page. To test for speed, I ran these two scripts five times each with two different sequence examples; a full genome file containing one long fasta record, and a Metagenome Assembled Genome (MAG) file containing 1708 shorter fasta records. Python was 2.9 times faster than Perl for the full genome, and 2.1 times faster than Perl for the MAG. I’ve recently become aware of the optimizing compiler RPerl, which allows Perl to be compiled into “ultra-fast and fully-compatible” C++, so I’m excited to try this out and see how the results compare to both Perl and Python.

[1] Noble PA, Citek RW, Ogunseitan OA. Tetranucleotide frequencies in microbial genomes. Electrophoresis. 1998 Apr;19(4):528-35. doi: 10.1002/elps.1150190412. PMID: 9588798.

[2] Kang DD, Li F, Kirton E, Thomas A, Egan R, An H, Wang Z. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019 Jul 26;7:e7359. doi: 10.7717/peerj.7359. PMID: 31388474; PMCID: PMC6662567.

Top comments (2)

Collapse
 
duffee profile image
Boyd Duffee

What were the average execution times for the scripts, or (what I really want to know is) how important is speed? Are we talking seconds, minutes or hours and how do the trials compare with a typical day-to-day workload?

Cheers, and good luck with the RPerl. I've heard the talk and am curious but never given it a spin.

Collapse
 
duffee profile image
Boyd Duffee

I finally got 'round to posting my reply Hope you like it ;)