Programs as proteins - an artificial genetic code

In the previous two posts of this series, we've developed two Joy programs. The first of these acts as a ribosome and is capable of translating any mRNA sequence into a corresponding polypeptide sequence. The second program acts as a chaperone that helps the polypeptide sequence to fold into its final and functional conformation. However, the situation is somewhat confusing, because the ribosome makes use of the real genetic code. And while it would be possible to consider the output as a valid Joy program, the functions of this program would correspond to real amino acids and therefore would not have any meaning in Joy.

In this post, we define an artificial genetic code so that the output of the ribosome composed with the chaperone would produce quoted Joy programs made up of conventional Joy functions. This will allow the ribosome (with the help of the chaperone) to produce quoted versions of itself and of the chaperone.

Biology

Code biology is a research field that is centred around the notion that so called organic codes are ubiquitous in biology and in fact underpin a new mechanism of evolution: adaptation by natural convention.

An organic code is a mapping by which a biological sign (usually in the form a a molecule or molecular sequence) is made to correspond to a biological meaning (either as a molecule or a molecular effect) in a fixed yet arbitrary way.

Much of code biology is based on the work of Marcello Barbieri, who proposed the following criteria by which organic codes can be identified:

The code must join two otherwise independent "worlds"
The code must be realized by an adapter that brings these worlds together
The code must be arbitrary as demonstrated by the ability to modify it experimentally

The genetic code was the first to be recognized as an organic code. It joins the otherwise independent worlds of RNA and polypeptides. Aminoacyl-tRNA molecules function as the adapters that realize the code. Genetic engineering of aminoacyl-tRNA synthetases allows for introducing changes to the standard genetic code. The genetic code therefore satisfies all three criteria.

Here is also an example of a mapping that does not satisfy these criteria. DNA is transcribed to mRNA during the process of translation. DNA and mRNA are both nucleic acids and do not as such necessarily represent independent worlds, but even if we give them the benefit of the doubt in this regard, there is no adapter molecule that maps DNA bases to RNA bases. And finally the mapping depends on the chemically deterministic pairing of pyrimidines to purines and cannot be altered experimentally without changing the laws of chemistry. The process of translation, while catalysed by enzymes, is therefore not governed by an organic code, but rather by the chemistry of nucleic acids.

Many other organic codes have been identified or proposed since the discovery of the genetic code. These include the histone code, epigenetic codes, metabolic codes, sugar codes, and many others. It is important to remember that, since these codes are arbitrary, the rules of the code must be contained and maintained inside the cell and ultimately in the genome. That is, the genetic code itself (and any other organic codes) is encoded in the genomes of organisms, allowing different species to use slightly different codes.

As a refresher, the genetic code maps codons (mRNA subsequences that are three bases long) onto amino acids (the building blocks of polypeptides/proteins). Since each of the three bases in a codon can have one of 4 values (A, C, G, U), there are 64 possible codons. There are however only 20 amino acids and as such the genetic code is a degenerative code - multiple codons map onto the same amino acid.

In the artificial life system that we are developing, the role of amino acids and nitrogen bases alike are played by Joy functions. An artificial genetic code in our system would therefore map functions that act as nitrogen bases onto functions that act as amino acids. The code will be realized by the ribosome (that for the time being also plays the role of aminoacyl-tRNA). Our code will not yet be modifiable because it is implemented in the primitive function translate. Future posts will address these shortcomings.

Code

The following is a list of all the Joy functions that we have required in order to program our artificial ribosome and chaperone. We may need a few additions later on and, because there is some redundancy, could even omit a few.

bra
ket
dup
pop (or zap)
swap
dip
i
cons
unit
cat
equal
ifte
a
c
g
u
translate

Technically we never made use of c and while a, u, and g were present in the ribosome implementation, they only occurred in quoted form and were never executed. Nevertheless, since they were present at all, we are forced to treat them as "amino acids", despite the fact that semantically they are nucleic acids. We include c for the sake of symmetry.

We also regrettably include translate, which embodies the actual mapping. We are treating translate as a primitive function. That is, the implementation is considered to be opaque. I have mentioned before that this amounts to cheating, but we permit it for now, because it is in fact easy, though tedious, to represent translate in terms of the other functions that are already on the list. However, the real reason for keeping translate around is that we will get rid of it in a more comprehensive way later that mimics biology more closely.

We now have a list of 17 functions that we will treat as amino acids. This means that the minimum number of nucleic acid bases in a codon needs to be 3, as is also the case in the real genetic code. We could trivially reduce the number of amino acids to below 17, which would allow us to make use of 2-base codons (as in the typogenetics system discussed in a previous post). We stick to a 3-bases code, however, because the redundancy of the genetic code is a key mechanism by which it absorbs mutation and allow for drift without necessarily altering function.

Here then is an artificial genetic code:

Base 1	Base 2	Base 3	Amino acid
a	a	a	a
a	a	c	i
a	a	g	dip
a	a	u	u
a	c	a	dip
a	c	c	dip
a	c	g	dip
a	c	u	pop
a	g	a	bra
a	g	c	ket
a	g	g	bra
a	g	u	ket
a	u	a	ifte
a	u	c	ifte
a	u	g	ifte
a	u	u	equal
c	a	a	dup
c	a	c	dup
c	a	g	dup
c	a	u	pop
c	c	a	dip
c	c	c	c
c	c	g	g
c	c	u	i
c	g	a	cons
c	g	c	cons
c	g	g	cons
c	g	u	cons
c	u	a	equal
c	u	c	equal
c	u	g	equal
c	u	u	ifte
g	a	a	pop
g	a	c	pop
g	a	g	pop
g	a	u	swap
g	c	a	i
g	c	c	i
g	c	g	i
g	c	u	dip
g	g	a	unit
g	g	c	c
g	g	g	g
g	g	u	unit
g	u	a	translate
g	u	c	translate
g	u	g	bra
g	u	u	ket
u	a	a	swap
u	a	c	swap
u	a	g	swap
u	a	u	dip
u	c	a	cons
u	c	c	cons
u	c	g	cons
u	c	u	cons
u	g	a	cat
u	g	c	cat
u	g	g	cat
u	g	u	cat
u	u	a	a
u	u	c	{stop}
u	u	g	{stop}
u	u	u	u

It is completely arbitrary, otherwise it wouldn't be a code, but I did build in some themes. aaa, for instance, maps onto a and ccc map onto c, etc. Furthermore, amino acids with similar functions were grouped together so that mutations generally convert amino acids into other amino acids with similar functions. The broad functions that I've considered are the following:

stack manipulation
unquoting / interpreting
quotation / list manipulation
conditionals
structural
misc
RNA
translation
stop codon

Let us consider the example Joy program (protein) from the previous post:

[[[] swap dup] i]

It's primary structure is:

[ bra bra ket swap dup ket i]

We can now finally map the primary amino acid sequence back to mRNA. One possibility could be:

[ a g a a g a a g c g a u c a a a g c a a c]

We are also in a position to derive the mRNA sequences for both our ribosome and chaperone implementation. However, these sequences are too long to be useful here.

Let us instead reflect on the role of context in cells. The genetic code is realized by the aminoacyl-tRNA synthetases that are present in cells. However, these very aminoacyl-tRNA synthetases can only be active when folded up into a functional tertiary structure. Successful protein folding on the other hand rely on cellular conditions that are conducive to proper folding and on chaperones. These conditions or intracellular milieu is in turn maintained by active transporters. And so we see that there is a circular dependency of cellular agents, being represented in the genome, relying on one another. How does one bootstrap such a system?

Protolife forms perhaps relied much more on the environment and on the laws of chemistry and were at the mercy of fluctuations in these conditions. In time, life forms internalized the maintenance of the conditions under which they thrive and so started to actively fabricate not only themselves, but the conditions that they require to survive. As soon as they could take these new conditions for granted, they could build on them and develop extraordinary degrees complexity by means of adaptation by natural convention: multiple layers of organic codes on top of the genetic code.

DEV Community

Programs as proteins - an artificial genetic code

Biology

Code

Oldest comments (0)