In this post we will be analyzing the Corona Virus 2 aka SARS-CoV-2 from first principles.
At a high level we can think of DNA as blueprints, RNA as the instructions, and Proteins as functions. From High school Biology class we can remember that DNA is made up of 4 nucleic acids guanine, adenine, cytosine and thymine; in RNA, uracil is used in place of thymine in a process called transcription. There are 20 common Amino acids each defined by a sets of three mRNA nucleotides (A, U, G and C). Proteins are a sequences of amino acid chains which fold into a specific shape to define its function.
Lets summarize what we learned from the interactive notebook above. We obtained both the genome and protein structure from the NCBI data bank. First we analyzed the annotation from the genome and dicover that its taxonomy is virus and that it molecular type is single stranded RNA. From there we explored its genome and throught the transcription and tranlsation process to obtain its amino acid sequence. To expedite the process we used the metadata to retrieve its protein coding sequences (CDS).
The 10 main CDS are A Chain of Proteins (ORF1ab), Spike Protein (S), Escape Artist (ORF3a), Envelope Protein (E), Membrane Protein (M), Signal Blocker (ORF6), Virus Liberator (ORF7a), Mystery Protein (ORF8), Nucleocapsid Protein (N), Mystery Protein (ORF10). The two important proteins to remeber are ORF1ab which funtion as the payload and S which is the exploit. The rest of the proteins assist to make everythig happen. Lastly, we visualized a 3d model of the main protein structure.
For a visual walkthrough visit The New York Times Infographic