## DEV Community is a community of 788,201 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

Danie Palm

Posted on

# Proteins as programs - to mock a matchingbird

In this instalment of Elixir of Life we devise a mechanism by which chemical affinity and binding can be realized in our artificial life system. As a refresher, we are trying to build an artificial life system using an artificial chemistry that is based on programs that interact with one another. These programs are written in a restricted version of the stack-based Joy programming language in which we permit only functions and quotations. This restricted version of Joy is implemented in Elixir, and as a side-effect we explore several features of Elixir along the way.

One of the Elixir features that we encounter in this post is the idea of using mocks to fulfil explicit contracts when testing code using the `mox` library. Coincidentally, Joy has strong ties to combinatory logic, its applicative cousin. Combinatory logic was invented by Moses Schönfinkel (German for something like "pretty finch"), elaborated on by Haskell Curry, and popularized by Raymond Smullyan in the delightful little book To Mock a Mockingbird.

In this book Smullyan likens function application in combinatory logic to birds (some of which are finches) calling out the name of a bird upon hearing the name of a bird. One of the popular birds in the book is the Mockingbird, which when hearing the bird name x will make the same call that x would have made hearing its own name. When a mocking bird hears its own name, it results in one of the most elegant self-reproducing "programs" or quines in existence.

If functions are birds, then this post will introduce a couple of matchingbirds. That kind of explains the title.

## Chemical recognition, affinity, and binding

The enzymatic complexes that initiate or catalyse the processes of transcription and translation often rely on recognizing patterns within the substrate sequences on which they operate. Examples of such patterns are the TATA box sequence, a stretch of DNA that indicates where transcription should begin, and the Shine-Dalgarno sequence, which lies upstream of the start codon AUG in bacterial and archaeal mRNA. We could even consider the start and stop codons of the genetic code to fall into this category of "sequences with meanings that need to be recognized".

While we could implement functionality that allows enzymes to recognize such sequences
with a predicate function like `equal` (as we have done up to now), the strict matching that we
get in such an implementation precludes any mistakes and variations from creeping in. Moreover, in biochemistry, sequences often match one another in a complementary way. Consider, for instance, how purines (A and G) strongly prefer to pair with pyrimidines (C and T/U) and weakly, if at all, with themselves. At first glance, such strong complementarity might not be obvious elsewhere in biochemistry, but consider how enzymatic substrate binding sites are really complementary to their substrates. Also consider how a single binding site could bind multiple related substrates, perhaps with different affinities or strengths.

So we turn to something fuzzy instead.

## The `match` predicate function

The `match` function is a predicate that returns true with a probability that is proportional to the degree to which a candidate `C` matches a given pattern `P`, taking complementary binding into account as appropriate. In other words:

``````C P match == B
``````

where `B` is a boolean value or, when using Church-encoding, `[pop i]` for true and `[swap pop i]` for false. In other words, `match` is more likely to return `[pop i]` for values of `C` that strongly match `P`. Let us consider a few concrete examples involving nucleotide bases. Note that the results are probabilistic, I'm providing the most probable answer:

``````# individual RNA bases
a u match == true
a g match == false

# sequences of bases
[] [] match == true
[] [a] match == false
[a u g] [u a c] match == true
``````

But how does `match` determine how well the candidate matches the pattern?

Recall the second post of this series in which we've implemented the Needleman-Wunsch global sequence alignment algorithm and hinted at using (complementary) sequence alignment scores as mechanisms by which binding affinity could be determined. In the meantime, I've extended that work to include the Smith-Waterman local sequence alignment algorithm, and modularized the code to allow pluggable implementations of the similarity score. These alignment algorithms are available in the `belign` library which I hope to open-source at some point.

This will allow us to determine a similarity or rather "match" score for any two molecules in our system.

Let's have a look at the implementation of `match` in our base language Elixir:

``````def match([pattern, candidate | tail] = _stack) do
# Determine the match score between the pattern and the candidate
score = Belign.Similarity.Global.similarity(pattern, candidate)
# Calculate a binding probability from the score
binding_probability = Float.pow(score * 1.0, 10)
# Pick a random pivot
pivot = Slyk.Random.random()

# Consider it a match if the binding probability is larger than the pivot.
# @truthy equals [pop i] and @falsy equals [swap pop i] as per a previous
# post on Church-encoding of booleans
if binding_probability > pivot do
[@truthy | tail]
else
[@falsy | tail]
end
end
``````

The comments explain the code fairly well. The heavy lifting is of course done in `Belign.Similarity.Global.similarity/2`. Without going into the details, you can imagine that function as representing a similarity matrix that has rows and columns for each molecule (function or quotation) in our artificial life system. The value at `i,j` in this matrix represents how well the candidate `j` matches the pattern `i`.

One aspect to highlight is that the similarity function can also deal with cases in which the pattern or candidate is not just a single function, but a list of functions (a quotation). This is where the Needleman-Wunsch algorithm comes in. When dealing with quotations, the similarity function delegates to
`Belign.align_global(pattern, candidate, opts)`. One of the options that is passed as the final argument is
this: `similarity: Belign.Similarity.Global` which tells the Needleman-Wunsch algorithm to use `Belign.Similarity.Global.similarity/2` as it's similarity score function. This recursive relationship between `Belign.Similarity.Global.similarity/2` and `Belign.align_global/3` means that we can determine the degree to which molecules, containing arbitrary levels of nesting, match one another.

### Testing the `match` predicate function

We have not discussed testing in any of the posts so far. However, a solid suite of tests at every level of abstraction
is going to be indispensable as we go forward. For most of the functions we've come across so far, unit tests are trivial to implement. The pure nature of these functions means that a given input will always lead to a given output. So we just have to think of all the cases we can hit, and we're done.

Here is a first approach at testing the `match` function.

``````test "match/1 correctly matches complementary mRNA sequences" do
assert ~J"[a u g] [u a c] match" == ~J"true"
assert ~J"[a u u] [u a c] match" == ~J"false"
assert ~J"[a u c] [u a c] match" == ~J"false"
assert ~J"[a u a] [u a c] match" == ~J"false"
# ... and many more
end
``````

However, the test only passes for a fraction of the times that we run it. This is because the `match` function is not a pure function. It has a random component to it. This is where `mox` comes in. `Slyk.Random.random/0` delegates to an underlying implementation. Under normal operating
conditions, it delegates to a function that calculates a uniform random floating point number between 0 and 1. An implementation of which looks like this:

``````defmodule Slyk.Random.Uniform do
@behaviour Slyk.Random.Behaviour

@impl Slyk.Random.Behaviour
@spec random :: float
def random() do
:rand.uniform()
end
end
``````

When we test however, we want a controlled environment, which we can obtain by using a mock implementation of `Slyk.Random`. We add the following to our `test/test_helpers.exs` file:

``````Mox.defmock(Slyk.Random.Mock, for: Slyk.Random.Behaviour)
``````

And then modify the test like this:

``````test "match/1 correctly matches complementary mRNA sequences" do
# Tell Slyk.Random to delegate to Slyk.Random.Mock
Application.put_env(:slyk, Slyk.Random, implementation: Slyk.Random.Mock)

# Provide a case-specific stub implementation of Slyk.Random.Mock
Slyk.Random.Mock
|> stub(:random, fn ->
0.8
end)

# The score is always above 0.8 for the first case
assert ~J"[a u g] [u a c] match" == ~J"true"
# The score is always below 0.8 for the last three cases
assert ~J"[a u u] [u a c] match" == ~J"false"
assert ~J"[a u c] [u a c] match" == ~J"false"
assert ~J"[a u a] [u a c] match" == ~J"false"
end
``````

## A primordial soup, a slimy pool

We are now finally getting close to a point where we have all the basic components required to realize our artificial life system. One of the last hurdles that we need to overcome is to devise a mechanism by which molecules (programs) in our system can bind, react, and dissociate.

For illustration purposes, the ribosome has been used like this so far:

``````mrna ribosome # leaves an unfolded protein on the stack
``````

This of course doesn't tell us how the mRNA ended up on the stack in the first place. What we really want is for molecules, and specifically enzymes like the ribosome, to bind their substrates from a pool of possibilities according to an inherent affinity. And to explicitly release any products to the pool, rather than leaving them on the stack.

Enter `getl`, `getg`, and `put`. The former two are similar to Joy's built-in `get` function which waits for input from the standard input. But they are actually closer to Elixir's `IO.gets/1` which takes a prompt as argument and then waits for input. Likewise, `put` is similar to Joy's `put` and Elixir's `IO.puts/1` that write a string to the standard output.

But instead of writing and reading strings to and from standard IO, we implement `getl` and `getg` to "read" a candidate program that matches a pattern (prompt) from a pool of programs using, respectively, the local (Smith-Waterman) and global (Needleman-Wunsch) alignment algorithms and placing the program on the stack as a quotation. Similarly, `put` takes a program from the top of the stack and inserts it into a pool of programs.

The pool that I'm referring to here is similar enough to the slimy one that we've described in the very first post of this series that I will not discuss it again, except to say that in its latest incarnation it implements the `Slyk.Pool` behaviour so that we can easily test programs that rely on it using a mock pool implementation. More on this in upcoming posts.

Here are the Elixir implementations for `getl` (`getg` is similar) and `put`:

``````def getl([pattern | tail] when is_list(pattern)) do
{:ok, match} = Slyk.Pool.getl(pattern)
[match | tail]
end
end

tail
end
``````

The default `Slyk.Pool` implementation is backed by a `GenServer`. This means that, for instance, `Slyk.Pool.getl` delegates to `Slyk.Pool.Default.getl`, which uses `GenServer.call` to make a synchronous request to a `GenServer` process, which when it deals with the incoming message, executes a callback that looks like this:

``````def handle_call({:getl, prompt}, _from, state) do
# Get a random pivot value
pivot = Slyk.Random.random()

# Find the index of the first element that has a
# binding probability that is larger than the random pivot
state
|> Enum.find_index(fn candidate ->
score = Belign.Similarity.Local.similarity(prompt, candidate)

binding_probability = Float.pow(score * 1.0, 10)
binding_probability > pivot
end)
|> case do
nil ->

index ->
{winner, rest} = List.pop_at(state, index)
end
end
``````

The probability function is a tenth-degree polynomial. We might refine this later, but for now it maps 0 to 0, 1 to 1, and most values in between to something close to 0.

For completeness, here is the callback that handles `put`. `Slyk.Pool.Default.put` uses `GenServer.cast`, which makes it asynchronous. This is appropriate because we are not interested in a reply from the pool process after it inserts the element in the pool.

``````def handle_cast({:put, element}, state) do
end
``````

### Testing the waters

There are various levels of abstraction at play here that require testing. We need to test `Slyk.Pool.Default` to ensure that `Slyk.Pool.Default.put`, `Slyk.Pool.Default.getl` and `Slyk.Pool.Default.getg` behave as expected. In these tests we care about the implementation details and so essentially wish to test the callback functions. The callback for `put` is easy to test because it is pure. The callback for `getl`, on the other hand again calls `Slyk.Random.random()` for which we will use a mock:

``````  describe "handle_call({:getl, [:a, :u, :g]}" do
test "binds [:u, :a, :c], regardless of where it is located in the pool" do
Slyk.Random.Mock
|> stub(:random, fn ->
0.8
end)

# winning candidate at end of pool
pool = [[:random], [:molecules], [:u, :a, :c]]

assert {:reply, {:ok, [:u, :a, :c]}, [[:random], [:molecules]]} =
Slyk.Pool.Default.handle_call({:getl, [:a, :u, :g]}, make_ref(), pool)

# winning candidate in middle of pool
pool = [[:random], [:u, :a, :c], [:molecules]]

assert {:reply, {:ok, [:u, :a, :c]}, [[:random], [:molecules]]} =
Slyk.Pool.Default.handle_call({:getl, [:a, :u, :g]}, make_ref(), pool)

verify!()
end

test "fails if [:u, :a, :c] or something similar is not present" do
Slyk.Random.Mock
|> stub(:random, fn ->
0.8
end)

# the pool
state = [[:random], [:molecules]]

assert {:reply, {:error, :no_match}, [[:random], [:molecules]]} =
Slyk.Pool.Default.handle_call({:getl, [:a, :u, :g]}, make_ref(), state)

verify!()
end
end
``````

At a higher level, we should be testing our newly introduced Joy functions `getl`, `getg`, and `put`. Their functionality is, however, independent of the underlying pool implementation, so instead of mocking `Slyk.Random`, we can opt to mock `Slyk.Pool` instead. At yet a higher level, we need tests for entire programs (proteins) and how they interact with their environment. More on that in another post.

## Conclusion

We are now approaching a chemical system in which molecules can be "activated" one at a time (or concurrently - effortlessly because of Elixir). Molecules are Joy programs and activation means execution. Joy programs are really function compositions (syntactically, whitespace represents the composition operator) that are made up of functions like `pop` and `swap`, and quotations which are lists of functions. A quotation is actually also just a function that places itself on the stack.

Using special `getl`/`getg` and `put` IO-like functions, molecules can "bind" or recruit other molecules from the mix, just like enzymes or proteins would bind their substrates. They can also "release" products back into the pool. It might turn out that only one of `getl` and `getg` is required. That will be a happy day, but for now `getl` will probably be more useful in binding polymers and `getg` will be more useful in binding oligomers/monomers.

In addition, we have explored preliminary implementations of a ribosome (something like a universal constructor) and a "chaperone" (something like a universal assembler). We have also defined a preliminary artificial genetic code. Armed with the notions of affinity, association, and dissociation developed in this post, we can now develop fully-fledged ribosomes, chaperones, and the multitude of other enzymes that we'll need (this is only the beginning). We can also then define the final version of an artificial genetic code, which we'll use to reverse-construct something like DNA, which will contain recipes for all the enzymes and rRNA in our system.