DEV Community: Maarten Demeyer

Game of Life, the DTerminal edition

Maarten Demeyer — Tue, 07 Jan 2020 00:00:00 +0000

Yet another offshoot from my Advent of Code 2019 adventures (first half, second half, all solutions): on day 24, the challenge was to program a variant of Conway’s game of life, and I figured I might as well try my approach on the real thing!

A quick google for existing implementations yields three main approaches: nested for, shifting of matrix rows and columns, and repeated filtering of a coordinate data.frame. Visualisations tend to rely on R’s plotting capabilities, and more recently {gganimate}.

I used data.table for the computations, because it’s fast and succinct. Here’s the setup of the Game of Life universe, randomly seeding half of the cells as alive, and defining the relevant relative ‘neighbourhood’ of each cell through a small auxiliary table. For those unfamiliar with data.table, CJ() performs a cross-join to obtain all combinations of the vector arguments.

library(data.table)

dims <- c(49, 49)
universe <- CJ(x = seq(dims[1]), y = seq(dims[2]), k = 1, cell = FALSE)
universe[, cell := sample(c(FALSE, TRUE), prod(dims), TRUE)]

neighbours <- CJ(xd = -1:1, yd = -1:1, k = 1)[xd != 0 | yd != 0]

Next, we want to define a function to perform one step (or tick ) of the game. The basic approach is to do a full Cartesian join of the neighbourhood and the universe, to determine the neighbouring coordinates of each cell. We clip off at the edges (unlike a proper GoL universe, which is infinite), and aggregate grouped by the original cell coordinate to count the number of neighbours. data.table allows us to express all of this in a really compact manner:

gol_tick <- function(xy, sz, nb) {

  nb[xy, on = .(k), allow.cartesian = TRUE
    ][, nbx := x + xd][, nby := y + yd
    ][nbx >= 1 & nbx <= sz[1] & nby >= 1 & nby <= sz[2]
    ][xy, on = .(nbx = x, nby = y)
    ][, .(nnb = sum(i.cell)), by = .(x, y, cell, k)
    ][!cell & nnb == 3, cell := TRUE
    ][cell & (nnb < 2 | nnb > 3), cell := FALSE
    ][, nnb := NULL]

}

So how about some visuals - and perhaps a bit of interaction? I chose to do this in the terminal, just to make the point that you can easily create these old-school games fully in R! You will need an ANSI-capable terminal emulator though, such as the default Ubuntu one. Do make it large enough (or the font small enough).

First, the interaction part. To collect keypresses without pausing the universe to prompt the user, we need the {keypress} package. Usage is as simple as calling keypress(FALSE) to get the currently pressed key. Second, the visuals. Geometric unicode characters can provide a nice grid layout, but how do we ensure that we update the visuals with each tick, instead of spitting out an endless sequence of universe states into the terminal? The answer is ANSI escape codes, which allow you to colour the output, clear terminal lines, and crucially move the cursor back to a previous position. All of this is achieved simply by outputting strings starting with \033[ (or \u001B[), followed by the ANSI instruction. For a more user-friendly interface to many of these functionalities have a look at the {cli} package - but here is the fully manual approach:

library(keypress)

cat("\033[?25l")

repeat ({

  kp <- keypress(FALSE)

  universe[order(y, x)
    ][, cat(fifelse(.SD$cell, "\033[32;40m◼", "\033[90;40m◌"),
            "\033[K\n"),
      by = y]

  cat("\033[2K\033[33;40m", sum(universe$cell), "\n")

  if (kp == "q") break

  if (kp == "x") {
    new_cells <- sample(c(FALSE, TRUE), prod(dims), TRUE, c(9, 1))
    universe[, cell := cell | new_cells]
  }

  Sys.sleep(0.2)

  cat(paste0("\033[", dims[2] + 1, "A"))

  universe <- gol_tick(universe, dims, neighbours)

})

cat("\033[?25h")

The game speed is throttled using Sys.sleep(), and the number of cells currently alive are displayed at the bottom. Two keys will be interpreted: q exits the game, and x insert new cells at random locations, to bring some new life to the eventually oscillatory universe!

The full code can be found in this gist. Run Rscript game_of_life.R and you should be seeing something like this:

Now go forth and multiply!

Advent of Code, the second half

Maarten Demeyer — Tue, 31 Dec 2019 00:00:00 +0000

So Advent of Code 2019 ended last week, and I got all 50 stars. The challenges became considerably more challenging compared to the first half, but base R did allow acceptably efficient solutions in almost all cases. My code is still on GitHub - here’s what I learned about R by writing it! 🎄

Tail recursion grows the stack

I kind of knew this already, but while R’s functional style invites using recursive functions, you will run into trouble if you recurse too deep. The simple reason is that each recursion will add to the call stack even when the recursion is the last expression of the function body (tail position). Many programming languages optimise for this situation to prevent stack growth, but R does not.

In particular, naive recursive solutions to the maze puzzles of Day 18 and Day 20 often led me into the dreaded Error: C stack usage XXXXXXX is too close to the limit message. There were hacky ways around this which allowed me to get the 🌟🌟 anyway, but the real solution I eventually found in this excellent blogpost by Alain Dipert. It’s called the trampoline and it implements recursion as iteration. I cannot possibly do a better job than Alain explaining it, so go and read his article!

Related to this, using deep recursion on several of the Advent of Code days made it apparent that creating unnecessary temporary variables in the function body will grow the size of each execution environment, and thereby the total memory usage of R.

Consider:

countdown <- function(n) {
    if (n > 0) {
        tmp <- runif(10000)
        countdown(n - 1)
    } else {
        gc()
    }
}

countdown(100)

## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 543979 29.1 1233580 65.9 641291 34.3
## Vcells 2017084 15.4 8388608 64.0 2536825 19.4

countdown(1000)

## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 550718 29.5 1233580 65.9 641291 34.3
## Vcells 11018937 84.1 15315970 116.9 11103124 84.8

Comment out the tmp assignment to see that it is indeed tmp which causes the difference.

So why use recursion at all in R, if it requires all these extra steps and considerations? For me, it is simply a better fit to the way my brain works when parsing a problem, and I am less prone to making mistakes as compared to using while/for flow control with constantly updated variables. As a bonus, recursive functions are also easier to unit test!

<<- can be very convenient

In my professional coding life I have rarely used the <<- operator, which allows you to assign a variable in the enclosing environment of a function. Because of R’s lexical scoping, this is always the environment in which the function was defined, not the one in which it was called. But unfortunately <<- causes functions to have side-effects, making code using <<- hard to read and to debug.

Yet in Advent of Code, I used it liberally. Often as a timesaver, so I didn’t have to pass variables around as function arguments. But on Day 14, it became also a lifesaver for keeping a common state between recursive function calls.

Consider this example:

countdown_with_offshoot <- function(n) {
    if (n > 0) {
        if (sum(vals) > 100) {
            vals <<- sample(vals, floor(length(vals) / 2))
            countdown_with_offshoot(sum(vals > 1))
        } else {
            vals <<- c(vals, sample(10, 1))
        }
        countdown_with_offshoot(n - 1)
    }
}

vals <- numeric(0)
countdown_with_offshoot(100)

This will sometimes launch an additional countdown based on numbers generated inside the recursive function calls. Because of lexical scoping, the enclosing environment is the same no matter how deep the recursions run, and <<- will therefore always assign to the same variable.

Can this be solved without <<-? Of course. But you will need to both pass vals along with the countdown itself, and return it when the offshoot countdown ends. And every function execution environment will keep its own state of vals in memory. <<- simplifies the code greatly and reduces memory use. For a one-off coding puzzle, that is a more than acceptable trade-off!

R does not support large integers

Day 22 was a mathematical puzzle requiring modular exponentiation of large integers. To my surprise, I was unable to solve it in base R by implementing modular exponentiation myself, because there is no support for very large integers! The integer type is 32-bit, and the double type has imperfect precision above 2**53.

## Explicitly casting to integer will yield NA
as.integer(2**53)

## Warning: NAs introduced by coercion to integer range

## [1] NA

## Below 2**53 a double represents integers accurately
(2**52 + 1:10) %% 2

## [1] 1 0 1 0 1 0 1 0 1 0

## Above 2**53 computations become innaccurate
## There is no warning for this behaviour
(2**53 + 1:10) %% 2

## [1] 0 0 0 0 0 0 0 0 0 0

I had to resort to the {gmp} package here, a wrapper around the GNU Multiple Precision Arithmetic Library library which does allow accurate computations with large integers. To install on Debian derivatives, run sudo apt-get install libgmp-dev prior to installing the R package.

(gmp::as.bigz(2**53) + 1:10) %% 2

## Big Integer ('bigz') object of length 10:
## [1] 1 0 1 0 1 0 1 0 1 0

Fortunately, {gmp} also has modular exponentiation built in, as gmp::powm(), so I could skip the manual implementation altogether!

More for the wishlist

Last time I complained about R not allowing key-value lookups using keys of any type, like Python’s dictionaries do allow (well, for immutable types). I ran into this problem multiple times again, although in the case of (x,y) coordinates I sometimes used complex numbers instead in a single data frame column, on which I could then filter to retrieve the associated values. But that is a rather limited use case.

Given that most of the Advent of Code challenges were general computing and not data science puzzles, R’s limitations in this regard were exposed as the problems became more complex. One-based indexing was sometimes slightly annoying when relying on modulus operations to access a vector, but this is unlikely to ever change in R. Immediate unpacking of multiple return values, however, would be a nice feature. The {zeallot} package implements this but it would great to see it in base R, so we don’t always need to construct and pick apart list objects.

For instance, from the documentation of {zeallot}:

library(zeallot)

c(lat, lon) %<-% list(38.061944, -122.643889)
print(lat)

## [1] 38.06194

Likewise I was often bothered by how heavy-handed it feels to define a class in R, or at least some other type of structural template for pieces of non-tabular data which belong together. When using R for data science this is rarely a problem, for general computing some syntactic elegance would be welcomed - perhaps something similar to defrecord in Clojure. In practice I usually just relied on a loosely defined list instead, but this becomes hard to manage for complex code.

Oh, and give me graphs

Around 11 out of 14 minutes of the total running time for all 25 days of my Advent of Code solutions are spent on Day 18 and Day 20, both maze pathfinding puzzles. They were solved through a combination of random walks where visited paths are tagged as such, and recursive functions. Other people, using other languages, appear to have relied on graphs a lot, where the shortest path in a maze is the shortest path across the nodes of a graph.

I could have tried to implement the main algorithms used like Dijkstra and BFS in base R, but I suspected that it would be too inefficient, without access to {Rcpp}. For implementing such algorithms natively a language like Julia would shine - not R. Now that Advent of Code is over and my self-imposed restriction to base is over though, I’d love to try and see how well {igraph} would perform in R on these particular challenges!

But that’ll be something for the new decade. HNY! 🎆

Advent of Code, the first half

Maarten Demeyer — Wed, 11 Dec 2019 00:00:00 +0000

’tis the season to be coding! As I’m sure you’ve all noticed (cough) I have been rather quiet on the blogging front lately because my leisure coding time has been consumed by the annual Advent of Code challenge. It would have been fun to use this to become more proficient in Clojure, but I don’t have that much leisure coding time, so base R it is. The main goal, other than plain old fun, is to give myself a good refresh of what base can do.

All solutions can be found on my Github.

Base is rich and versatile

I’ve been surprised to remember how much I actually like base R, given how avid a user I’ve been of data.table, purrr and rlang in recent years. The apply family, array operations and functional programming tools like Reduce() will get you very far indeed on moderately sized data. When using data frames base R falls well short in terms of API, compared to the modern champions of data munging, but it does still do the job.

Inevitably I’ve learned a few new things about base R. Here are some of the highlights.

rle() was brought to my attention by the Day 4 solution of Adam Gruer. This was easily my worst day of Advent Of Code in terms of succinctness and speed of the solution, and it would have helped a lot to know that Run Length Encoding of repeated values in a vector is available in base R, just like that.

rle(c("a","a","b"))

## Run Length Encoding
## lengths: int [1:2] 2 1
## values : chr [1:2] "a" "b"

strsplit() was not a new function to me, but I didn’t know it can take split = "" as an argument to just split out every individual character! Very helpful when many puzzle inputs are given as plain text character sequences.

strsplit("abcde", split = "")

## [[1]]
## [1] "a" "b" "c" "d" "e"

which() is another basic function that I’ve used so often before, without realising it can compute n-D array indices through the arr.ind = TRUE argument.

mat <- matrix(sample(c(FALSE, TRUE), size = 16, replace = TRUE), ncol = 4)
which(mat, arr.ind = TRUE)

## row col
## [1,] 1 1
## [2,] 3 1
## [3,] 2 2
## [4,] 2 3
## [5,] 3 3

merge() is the natural replacement of dplyr’s join functions in base R, but did you know it can also do a full crossing of a data frame with itself? Just set the by argument explicitly to empty.

df <- data.frame(a = letters[1:3], b = 1:3)
merge(df, df, by = character(0), suffixes = c("_one", "_two"))

## a_one b_one a_two b_two
## 1 a 1 a 1
## 2 b 2 a 1
## 3 c 3 a 1
## 4 a 1 b 2
## 5 b 2 b 2
## 6 c 3 b 2
## 7 a 1 c 3
## 8 b 2 c 3
## 9 c 3 c 3

Not all is great, however. Today (Day 11) I really missed being able to key named lists or envs by something else than strings. In Python dicts you can use any immutable type, including tuples. So I could have directly used (x,y) coordinates as a key, instead of having to awkwardly paste and parse them to and from strings (or take a cumbersome detour via data frame indexing). A longstanding shortcoming for R as a general-purpose computing language, in my opinion.

…not another intcode!

I’ve been particularly fond of the ‘intcode’ challenges. With ‘fond of’, I mean I hated them. But in a good way. Not to be too cryptic to non-participants, these are challenges where a sequence of integers is to be processed as if they were read/write/jump/print/… instructions within the sequence, each followed by a variable number of parameters. With every new challenge, more instructions and possible interpretations of the integers are added.

This means these intcode challenges are really challenges about software design and development, where scope is ever creeping and requirements are ever changing. I like this. It forces you to strike a balance between being pragmatic about the challenge of the day, and still making the solution extensible on future days. This to me is far more relevant and realistic as a coding challenge, and far more telling of actual skill and craft as a developer, than coming up with a clever algorithmic one-liner.

A favourite

If I could pick a favourite so far, it would have to be Day 6: Universal Orbit Map. A list of objects is given such that each object orbits exactly one other object. The challenge is to compute how many orbits there are in total, both direct and indirect. Many people were immediately triggered into thinking ‘graph!’, and in truth, so was I. But being restricted to base R, I looked for a less exhausting solution. So I quickly realised that there was really no need for a full graph, we only need to separately trace back each individual object to the common center of gravity and add up all the steps.

Here is the full solution, using this input:

inp <- read.delim("aoc_input6.txt", sep = ")", header = FALSE,
                  col.names = c("orbitee", "orbiter"),
                  stringsAsFactors = FALSE)

memo_env <- new.env()

count_orbits <- function(obj) {
  if (!is.null(memo_env[[obj]])) {
    memo_env[[obj]]
  } else if (obj %in% inp$orbiter) {
    memo_env[[obj]] <- 1 + count_orbits(inp[inp$orbiter == obj,]$orbitee)
  } else {
    0
  }
}

sum(sapply(unique(inp$orbiter), count_orbits))

## [1] 142915

I enjoyed this because there is so much really powerful stuff going on in so few lines of code. First we read the data just like a two-column CSV, but using ) as the delimiter. The data frame structure comes very naturally to R, and often offers an intuitive insight into a problem. This goes in general for many of the Advent of Code challenges - when using a data frame you always have the current state of the data clearly visible in front of you, and can work incrementally towards a solution using all of R’s great data frame manipulation tools. In this case the data frame is mainly useful as an explicit lookup table.

Then there is recursion going on - the count_orbits() function calls itself to find the next orbited object, until it reaches the center. Every recursion adds 1 to the counter. Because many of these paths are partially shared, the recursive function can be sped up considerably by using memoisation - we really don’t need to retrace a path we’ve been down already. Instead, we save the solution in a dedicated environment (a named list or vector would work as well, but environments scale better for larger problems) and retrieve it as needed.

Finally, the sapply call shows that you don’t need for-loops much, even in base R, if you can condense repeated independent operations on vector elements into a unary function. The resulting vector is immediately aggregated into the puzzle solution on the same line. A recipe that is succinct and clear, and oh so widely applicable. However much I love purrr for its explicit consistency, 90%+ of its practical use cases are really already covered in base R - it’s easy to forget when you’re hooked on the %>%.

All right then

Back into Santa’s service I go!

R scripts as command line tools

Maarten Demeyer — Wed, 13 Nov 2019 00:00:00 +0000

Most R users rely heavily on the interactive console for writing and executing code, but sometimes you will want to expose your work to a world outside of that cosy cocoon. One solution is to wrap a web API around your code, for instance using the excellent {plumber}. The other main option is to wrap it in a command-line interface (CLI), so it can be used from the shell like any regular program.

To run R code directly from a Linux shell, we must use the Rscript executable instead of regular R (or alternatively, {littler}). For instance, suppose we want to expose the functionality of the {emo} package - retrieve an emoji by name.

# To install {emo}, do: remotes::install_github("hadley/emo")
# commandArgs() parses arguments provided on the command line into a <list>
args <- commandArgs(trailingOnly = TRUE)

# Use cat() for output
cat(emo::ji(args[[1]]), "\n")

Save this code to getemo.R, and then this is what Rscript enables us to do:

Rscript getemo.R unicorn

(the character encoding and font used by your terminal must support rendering unicode emojis)

So far so good, and in many cases this will even be good enough! But our little script does not yet behave like a regular command-line tool, where we would hope to just be able to do from any directory:

getemo unicorn

So how?

First let me give a hat tip to Colin Fay for describing a method where you let the Node package manager, npm, do all the work. This is really great when you are already a Node user, but if you’re not it is probably too much overhead simply to expose an R script.

Three simple steps are all we really need.

1. Add a shebang line at the top

Linux systems (and other Unix-likes) have a standard way of specifying in a comment at the top of plain-text scripts which program should be used to run them - the shebang. You could use:

#!/usr/bin/Rscript --vanilla

The --vanilla argument makes sure that user-specific R settings are ignored, and that there is no saving or restoring of workspaces. This makes the script more portable to other systems.

You will often see /usr/bin/env Rscript used, also for reasons of portability. But there are some potential version differences if you want to specify the --vanilla argument, so if you’re unsure just stick to the direct version for now.

2. Make the script executable

Before you can run a script directly from the command line, you must tell the file system that this plain text file is indeed a script which can be run, rather than any old text file, and that this user is allowed to run it. So do:

chmod +x getemo.R

You will be able to check with a simple ls -all getemo.R command that it has received x’s in its permissions. It can now be run with:

./getemo.R unicorn

We’re getting there, but we still needed to specify the path to the script explicitly to be able to run it (in this case, the local directory .).

3. Make it available in your $PATH

Now for the ‘from any directory’ part of the requirements. Running a program from any directory is typically achieved by adding the directory of that program to the $PATH environment variable. But we don’t want to be littering this with custom script directories so let’s see what is already in $PATH, using echo $PATH. Either ~/bin or ~/.local/bin, or both, are often present in desktop linux installations; these are standard directories for executables belonging to your home directory. If you ls -all them you will notice they mainly contain symbolic links to files in other directories.

This is exactly what we are going to do with our R script as well. In addition, let’s drop the .R extension when making that -s link. So, assuming that ~/bin/ is in our $PATH:

ln -s -r getemo.R ~/bin/getemo

The -r option makes the link between the paths relative, since I don’t know in which directory the original script would be placed on your computer. You can omit it but then you should specify both paths in full.

Now we can get the desired result by executing, from any directory:

getemo unicorn

The full script, including the shebang, can be found in this gist.

Do more

We didn’t make the CLI tool available to the entire system here, only to your user profile. Unless you have good reasons to the contrary, I would advise to keep it that way. Many Linux desktops have in practice only one user, and inside Docker containers you can use the /root user. If you do want to expose a script system-wide, /usr/local/bin is usually the appropriate directory.

To learn how to write better CLI tools in R, have a look at Mark Sellor’s blog on the topic. For parsing and documenting CLI arguments, I personally prefer {argparse} if the Python dependency is not an issue; and {docopt} when it is. For stylising the visual outputs, {cli} is great.

Cool, but why?

Well here’s a question people should ask more often! Some of the reasons I can see are:

Make your work available to non-R users in an interface they can understand
Make your work available to production frameworks that run tasks as commands, like Airflow
For your own convenience, expose R-specific functionality to the command line. emo::ji() was arguably not the world’s greatest example of a productivity improvement, but what about wrapping something like {skimr}, to beautifully preview CSVs?
R might not always be the objectively best tool for the job but it could well be your best tool. If R is indeed the language you are most fluent in, it will probably be most productive for you to script even non-data-related tasks in R.
If written properly, an R command line tool can be used together with other, non-R command line tools. For instance to provide data to it, or to further process the output. But that’s a topic for another time!

Cool.

I know!