Get GC Content

#perl #bioinformatics #genomics

I wrote a Perl script that calculates GC content for all the sequences in a fasta file back in 2010 for the Reysenbach Lab, and it was made publicly available soon after. You can find get_gc_content.pl on GitHub (https://github.com/jmeneghin/perl-for-reysenbach-lab)

Here is a published article that uses my script:
Schilling, T., Hoppert, M., and Hertel, R. 2018. Genomic Analysis of the Recent Viral Isolate vB_BthP-Goe4 Reveals Increased Diversity of φ29-Like Phages. Viruses, 10(11), 624, doi: 10.3390/v10110624.

And also a PhD Dissertation:
Quistad, S. D. 2015. Viruses, Corals and the Origin of Metazoans. UC San Diego, San Diego, CA.

From a programming prospective, A DNA sequence is a sequence of As, Gs, Cs and Ts. The GC content of a sequence is the number of Gs + the number of Cs divided by the total number of As, Gs, Cs and Ts in the sequence (i.e. the length of the sequence), multiplied by 100% to get a percentage.

Actual DNA is double stranded, but when we look at the sequence on a computer, we don’t need to look at the second strand because base pairs only form naturally between G and C, and A and T. So if you have a G on one strand you know you have a C on the other at the same location.

If you are counting both strands, the total number of Gs and Cs should be the same, overall. Similarly, the total number of As and Ts should be the same, overall. And since there are only two groups here, if I give you the percentage of Gs and Cs on the first strand, it will be the same as the second strand, and also equal to 1 minus the percentage of As and Ts on the first and second strand. So it’s a nice summary statistic.

This script takes a fasta file as its first (and only) parameter. It returns a tab delimited file (gc_out.txt): column 1 = header ID (everything between “>” and the first space in the header), column 2 = gc content for the fasta entry, column 3 = the total nucleotide count, and columns 4, 5, 6, and 7 = the number of A’s, G’s, C’s and T’s for each fasta record. Any other extraneous symbols (such as spaces or Ns that stand for unknown) are not included in the calculation.

Many articles have been published about the role of GC content in the evolution of genomes, including studies on bacteria, plants, and animals including humans.

Top comments (1)

E. Choroba • Aug 30 '22 • Edited

Interestingly, the script would be written something like this nowadays:

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

my $infile = shift // usage();
my $outfile = shift // 'gc_out.txt';

open my $in,  '<', $infile  or die "Cannot open $infile: $!";
open my $out, '>', $outfile or die "Cannot open $outfile: $!";

say {$out} join "\t",
    'ID', '% GCContent', 'Total Count',
    'G Count', 'C Count', 'A Count', 'T Count';

my $seq = "";
while (<$in>) {
    chomp;
    if (/^>/) {
        process($seq) if length $seq;
        $seq = "";
        my ($id) = /^>(\S+)/;
        print {$out} "$id\t";
    } else {
        $seq .= $_;
    }
}
process($seq);

sub process {
    my ($seq) = @_;
    my %count;
    for (split //, $seq) {
        ++$count{TOTAL} if /[a-z]/i;
        ++$count{+uc}   if /[atgc]/i;
        ++$count{GC}    if /[gc]/i;
    }
    my $gc_content = $count{TOTAL} ? (100 * $count{GC}) / $count{TOTAL} : 0;
    say {$out} join "\t", $gc_content, map $count{$_} // 0, qw( TOTAL G C A T );
}

sub usage {
    print << '__USAGE__';
Get GC Content;
Usage: get_gc_content.pl <fasta file> [<output file>];
  This program takes a fasta file as it's first parameter.
  The optional second parameter is the name of the output file, gc_out.txt
      by default.
  The output file is tab delimited: column 1 = header ID (everything between >
 and the first space in the header), and column 2 = gc content for the fasta entry.
  Author: Jennifer Meneghin
__USAGE__
    exit
}