Calculating tetranucleotide (k-mer) frequencies

Jennifer Meneghin — Thu, 01 Sep 2022 19:01:02 +0000

Last week I posted about a Perl script I wrote that calculates %GC content for genome sequences. This week, I’m generalizing this idea to k-mers. In bioinformatics, k-mers are substrings of length k in biological sequences.

In 2016 I wrote a Perl script for the Reysenbach lab that calculates k-mer frequencies, for all k-mers in a given sequence. It was made public soon after, and you can find get_kmer_frequencies.pl on my Github page.

Typically, in the lab, we were interested in tetranucleotide frequencies. Tetranucleotides are k-mers of length four. Tetranucleotide frequencies have been used in the analysis of microbial genomes for many years [1], and they are often used in the process of binning contigs in metagenomics [2].

Originally, I wrote a script called get_tetranucleotide_frequencies.pl that was used in this PhD dissertation:

Buerger, P. 2017. Viruses: contributors to and mitigators of black band disease in corals. James Cook University, Queensland, Australia.

The newer get_kmer_frequencies.pl script was used to calculate tetranucleotide frequencies in this Master’s thesis:

St. John, E. 2019. Symbiosis in Archaea: Functional and Phylogenetic Diversity of Marine and Terrestrial Nanoarchaeota and their Hosts.

The new script allows you to calculate k-mers of any length. This script takes three parameters; a fasta file, k and a file name prefix:

get_kmer_frequencies.pl infile.fasta k fileprefix

It returns a tab delimited file called fileprefix_kmer.txt, where fileprefix is the prefix you specified in the parameters. The columns are fasta records, and rows are k-mer counts. Notice all k-mer counts are given here — so if your data contains any other symbols besides A, C, G, and T, those will be included in the k-mer combinations. Extra rows are an indication of dirty data, and I would remove these rows for analysis purposes.

The newer get_kmer_frequencies.pl script also takes into consideration that reverse complimented k-mers should be summed together.

"DNA occurs as a double strand where each A is paired with a T and vice versa, and each C is paired with a G and vice versa. The reverse complement of a DNA sequence is formed by reversing the letters, interchanging A and T and interchanging C and G. Thus the reverse complement of ACCTGAG is CTCAGGT.” From A Computer Scientist’s Dictionary for Genomics

The fasta sequences we look at on the computer only contain the nucleotides of a single strand of DNA, while actual DNA is double stranded. So, if our fasta sequence DNA has the k-mer AAAA 20 times, for example, then the other strand in actual reality should have TTTT 20 times. But, that second strand also contains AAAA, say 12 times for example, and then the first strand should also contain TTTT 12 times. So, in double stranded reality, AAAA/TTTT shows up 32 times on the double stranded string (for this example), sometimes with AAAA on the first strand, sometimes on the second strand. To take this into consideration, we sum the reverse complements.

For tetranucleotides, this means instead of 256 possible combinations (4 to the 4th power — given there are 4 nucleotides and a k-mer sequence length of 4), there are actually only 136. Why not 128? Because 16 tetranucleotide sequences are reverse complements of themselves (e.g., AATT and CATG). So, we have (256-16)/2 = 120 tetranucleotides with reverse complements, and then we add back in the tetranucleotides without reverse complements to get 120+16 = 136 total tetranucleotides.

These calculations are similar for any length k-mer. We only find k-mers with self-reverse-complements with even length k-mers. So, for length 3, we have 4 to the 3rd power divided by 2, or 64/2 = 32 possible 3-mers. For length 1, we have 4 to the 1st power divided by 2, or 2 1-mers, A/T and C/G. Bringing this full circle, if we express our 1-mers as a percentage, we only need to look at A/T or C/G. If we choose C/G, that’s what’s known as %GC content.

In 2020 I rewrote several of my Perl scripts in Python, including get_kmer_frequencies.py, which can be run as follows.

get_kmer_frequencies.py -i infile.fasta -k k -p fileprefix

This script is available for download on my GitHub page. To test for speed, I ran these two scripts five times each with two different sequence examples; a full genome file containing one long fasta record, and a Metagenome Assembled Genome (MAG) file containing 1708 shorter fasta records. Python was 2.9 times faster than Perl for the full genome, and 2.1 times faster than Perl for the MAG. I’ve recently become aware of the optimizing compiler RPerl, which allows Perl to be compiled into “ultra-fast and fully-compatible” C++, so I’m excited to try this out and see how the results compare to both Perl and Python.

[1] Noble PA, Citek RW, Ogunseitan OA. Tetranucleotide frequencies in microbial genomes. Electrophoresis. 1998 Apr;19(4):528-35. doi: 10.1002/elps.1150190412. PMID: 9588798.

[2] Kang DD, Li F, Kirton E, Thomas A, Egan R, An H, Wang Z. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019 Jul 26;7:e7359. doi: 10.7717/peerj.7359. PMID: 31388474; PMCID: PMC6662567.

Get GC Content

Jennifer Meneghin — Wed, 24 Aug 2022 19:21:43 +0000

I wrote a Perl script that calculates GC content for all the sequences in a fasta file back in 2010 for the Reysenbach Lab, and it was made publicly available soon after. You can find get_gc_content.pl on GitHub (https://github.com/jmeneghin/perl-for-reysenbach-lab)

Here is a published article that uses my script:
Schilling, T., Hoppert, M., and Hertel, R. 2018. Genomic Analysis of the Recent Viral Isolate vB_BthP-Goe4 Reveals Increased Diversity of φ29-Like Phages. Viruses, 10(11), 624, doi: 10.3390/v10110624.

And also a PhD Dissertation:
Quistad, S. D. 2015. Viruses, Corals and the Origin of Metazoans. UC San Diego, San Diego, CA.

From a programming prospective, A DNA sequence is a sequence of As, Gs, Cs and Ts. The GC content of a sequence is the number of Gs + the number of Cs divided by the total number of As, Gs, Cs and Ts in the sequence (i.e. the length of the sequence), multiplied by 100% to get a percentage.

Actual DNA is double stranded, but when we look at the sequence on a computer, we don’t need to look at the second strand because base pairs only form naturally between G and C, and A and T. So if you have a G on one strand you know you have a C on the other at the same location.

If you are counting both strands, the total number of Gs and Cs should be the same, overall. Similarly, the total number of As and Ts should be the same, overall. And since there are only two groups here, if I give you the percentage of Gs and Cs on the first strand, it will be the same as the second strand, and also equal to 1 minus the percentage of As and Ts on the first and second strand. So it’s a nice summary statistic.

This script takes a fasta file as its first (and only) parameter. It returns a tab delimited file (gc_out.txt): column 1 = header ID (everything between “>” and the first space in the header), column 2 = gc content for the fasta entry, column 3 = the total nucleotide count, and columns 4, 5, 6, and 7 = the number of A’s, G’s, C’s and T’s for each fasta record. Any other extraneous symbols (such as spaces or Ns that stand for unknown) are not included in the calculation.

Many articles have been published about the role of GC content in the evolution of genomes, including studies on bacteria, plants, and animals including humans.

DEV Community: Jennifer Meneghin

Calculating tetranucleotide (k-mer) frequencies

Get GC Content