DEV Community: Bruno Rodrigues

Reproducible data science with Nix, part 3 -- frictionless {plumber} api deployments with Nix

Bruno Rodrigues — Wed, 02 Aug 2023 07:52:31 +0000

This is the third post in a series of posts about Nix. Disclaimer: I’m a super beginner with Nix. So this series of blog posts is more akin to notes that I’m taking while learning than a super detailed tutorial. So if you’re a Nix expert and read something stupid in here, that’s normal. This post is going to focus on R (obviously) but the ideas are applicable to any programming language.

This blog post is part tutorial on creating an api using the {plumber} R package, part an illustration of how Nix makes developing and deploying a breeze.

Part 1: getting it to work locally

So in part 1 I explained what Nix was and how you could use it to build reproducible development environments. In part 2 I talked about running a {targets} pipeline in a reproducible environment set up with Nix, and in this blog post I’ll talk about how I made an api using {plumber} and how Nix made going from my development environment to the production environment (on Digital Ocean) the simplest ever. Originally I wanted to focus on interactive work using Nix, but that’ll be very likely for part 4, maybe even part 5 (yes, I really have a lot to write about).

Let me just first explain what {plumber} is before continuing. I already talked about {plumber} here, but in summary, {plumber} allows you to build an api. What is an api? Essentially a service that you can call in different ways and which returns something to you. For example, you could send a Word document to this api and get back the same document converted in PDF. Or you could send some English text and get back a translation. Or you could send some data and get a prediction from a machine learning model. It doesn’t matter: what’s important is that apis completely abstract the programming language that is being used to compute whatever should be computed. With {plumber}, you can create such services using R. This is pretty awesome, because it means that whatever it is you can make with R, you could build a service around it and make it available to anyone. Of course you need a server that actually has R installed and that gets and processes the requests it receives, and this is where the problems start. And by problems I mean THE single biggest problem that you have to deal with whenever you develop something on your computer, and then have to make it work somewhere else: deployment. If you’ve had to deal with deployments you might not understand why it’s so hard. I certainly didn’t really get it until I’ve wanted to deploy my first Shiny app, many moons ago. And this is especially true whenever you don’t want to use any “off the shelf” services like shinyapps.io. In the blog post I mentioned above, I used Docker to deploy the api. But Docker, while an amazing tool, is also quite heavy to deal with. Nix offers an alternative to Docker which I think you should know and think about. Let me try to convince you.

So let’s make a little {plumber} api and deploy that in the cloud. For this, I’m using Digital Ocean, but any other service that allows you to spin a virtual machine (VM) with Ubuntu on it will do. If you don’t have a Digital Ocean account, you can use my referral link to get 200$ in credit for 60 days, more than enough to experiment. A VM serving a {plumber} api needs at least 1 gig of RAM, and the cheapest one with 1 gig of ram is 6$ a month (if you spend 25$ of that credit, I’ll get 25$ too, so don’t hesitate to experiment, you’ll be doing me a solid as well).

I won’t explain what my api does, this doesn’t really matter for this blog post. But I’ll have to explain it in a future blog post, because it’s related to a package I’m working on, called {rix} which I’m writing to ease the process of building reproducible environments for R using Nix. So for this blog post, let’s make something very simple: let’s take the classic machine learning task of predicting survival of the passengers of the Titanic (which was not that long ago in the news again…) and make a service out of it.

What’s going to happen is this: users will make a request to the api giving some basic info about themselves: a simple ML model (I’ll go with logistic regression and call it “machine learning” just to make the statisticians reading this seethe lmao), the machine learning model is going to use this to compute a prediction and the result will be returned to the user. Now to answer a question that comes up often when I explain this stuff: why not use Shiny? Users can enter their data and get a prediction and there’s a nice UI and everything?!. Well yes, but it depends on what it is you actually want to do. An api is useful mostly in situations where you need that request to be made by another machine and then that machine will do something else with that prediction it got back. It could be as simple as showing it in a nice interface, or maybe the machine that made the request will then use that prediction and insert it somewhere for archiving for example. So think of it this way: use an api when machines need to interact with other machines, a Shiny app for when humans need to interact with a machine.

Ok so first, because I’m using Nix, I’ll create an environment that will contain everything I need to build this api. I’m doing that in the most simple way possible, simply by specifying an R version and the packages I need inside a file called default.nix. Writing this file if you’re not familiar with Nix can be daunting, so I’ve developed a package, called {rix} to write these files for you. Calling this:

rix::rix(r_ver = "4.2.2",
         r_pkgs = c("plumber", "tidymodels"),
         other_pkgs = NULL,
         git_pkgs = NULL,
         ide = "other",
         path = "titanic_api/", # you might need to create this folder
         overwrite = TRUE)

generates this file for me:

# This file was generated by the {rix} R package on Sat Jul 29 15:50:41 2023
# It uses nixpkgs' revision 8ad5e8132c5dcf977e308e7bf5517cc6cc0bf7d8 for reproducibility purposes
# which will install R version 4.2.2
# Report any issues to https://github.com/b-rodrigues/rix
{ pkgs ? import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/8ad5e8132c5dcf977e308e7bf5517cc6cc0bf7d8.tar.gz") {} }:

  with pkgs;

  let
  my-r = rWrapper.override {
    packages = with rPackages; [
      plumber tidymodels
    ];
  };
  in
  mkShell {
    buildInputs = [
      my-r
      ];
  }

(for posterity’s sake: this is using this version of {rix}. Also, if you want to learn more about {rix} take a look at its website. It’s still in very early development, comments and PR more than welcome!)

To build my api I’ll have to have {plumber} installed. I also install the {tidymodels} package. I actually don’t need {tidymodels} for what I’m doing (base R can fit logistic regressions just fine), but the reason I’m installing it is to mimic a “real-word example” as closely as possible (a project with some dependencies).

When I called rix::rix() to generate the default.nix file, I specified that I wanted R version 4.2.2 (because let’s say that this is the version I need. It’s also possible to get the current version of R by passing “current” to r_ver). You don’t see any reference to this version of R in the default.nix file, but this is the version that will get installed because it’s the version that comes with that particular revision of the nixpkgs repository:

"https://github.com/NixOS/nixpkgs/archive/8ad5e8132c5dcf977e308e7bf5517cc6cc0bf7d8.tar.gz"

This url downloads that particular revision on nixpkgs containing R version 4.2.2. {rix} finds the right revision for you (using this handy service).

While {rix} doesn’t require your system to have Nix installed, if you want to continue you’ll have to install Nix. To install Nix, I recommend you don’t use the official installer, even if it’s quite simple to use. Instead, the Determinate Systems installer seems better to me. On Windows, you will need to enable WSL2. An alternative is to run all of this inside a Docker container (but more on this later if you’re thinking something along the lines of isn’t the purpose of Nix to not have to use Docker? then see you in the conclusion). Once you have Nix up and running, go inside the titanic_api/ folder (which contains the default.nix file above) and run the following command inside a terminal:

nix-build

This will build the environment according to the instructions in the default.nix file. Depending on what you want/need, this can take some time. Once the environment is done building, you can “enter” into it by typing:

nix-shell

Now this is where you would use this environment to work on your api. As I stated above, I’ll discuss interactive work using a Nix environment in a future blog post. Leave the terminal with this Nix shell open and create an empty text wile next to default.nix and call it titanic_api.R and put this in there using any text editor of your choice:

#* Would you have survived the Titanic sinking?
#* @param sex Character. "male" or "female"
#* @param age Integer. Your age.
#* @get /prediction
function(sex, age) {

  trained_logreg <- readRDS("trained_logreg.rds")

  dataset <- data.frame(sex = sex, age = as.numeric(age))

  parsnip::predict.model_fit(trained_logreg,
                             new_data = dataset)

}

This script is a {plumber} api. It’s a simple function that uses an already trained logistic regression (lol) by loading it into its scope using the readRDS() function. It then returns a prediction. The script that I wrote to train the model is this one:

library(parsnip)

titanic_raw <- read.csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")

titanic <- titanic_raw |>
  subset(select = c(Survived,
                    Sex,
                    Age))

names(titanic) <- c("survived", "sex", "age")

titanic$survived = as.factor(titanic$survived)

logreg_spec <- logistic_reg() |>
  set_engine("glm")

trained_logreg <- logreg_spec |>
  fit(survived ~ ., data = titanic)

saveRDS(trained_logreg, "trained_logreg.rds")

If you’re familiar with this Titanic prediction task, you will have noticed that the script above is completely stupid. I only kept two variables to fit the logistic regression. But the reason I did this is because this blog post is not about fitting models, but about apis. So bear with me. Anyways, once you’re run the script above to generate the file trained_logreg.rds containing the trained model, you can locally test the api using {plumber}. Go back to the terminal that is running your Nix shell, and now type R to start R in that session. You can then run your api inside that session using:

plumber::pr("titanic_api.R") |>
  plumber::pr_run(port = "8000")

Open your web browser and visit http://localhost:8000/docs/ to see the Swagger interface to your api (Swagger is a nice little tool that makes testing your apis way easier).

Using Swagger you can try out your api, click on (1) then on (2). You can enter some mock data in (3) and (4) and then run the computation by clicking on “Execute” (5). You’ll see the result in (7). (6) gives you a curl command to run exactly this example from a terminal. Congrats, your {plumber} api is running on your computer! Now we need to deploy it online and make it available to the world.

Deploying your api

So if you have a Digital Ocean account log in (and if you don’t, use my referral link to get 200$ to test things out) and click on the top-right corner on the “Create” button, and then select “Droplet” (a fancy name for a VM):

In the next screen, select the region closest to you and then select Ubuntu as the operating system, “Regular” for the CPU options, and then the 4$ (or the 6$, it doesn't matter at this stage) a month Droplet. We will need to upgrade it immediately after having created it in order to actually build the environment. This is because building the environment requires some more RAM than what the 6$ option offers, but starting from the cheapest option ensures that we will then be able to downsize back to it, after the build process is done.

Next comes how you want to authenticate to your VM. There are two options, one using an SSH key, another using a password. If you’re already using Git, you can use the same SSH key. Click on “New SSH Key” and paste the public key in the box (you should find the key under ~/.ssh/id_rsa.pub if you’re using Linux). If you’re not using Git and have no idea what SSH keys are, my first piece of advice is to start using Git and then to generate an SSH key and login using it. This is much more secure than a password. Finally, click on “Create Droplet”. This will start building your VM. Once the Droplet is done building, you can check out its IP address:

Let’s immediately resize the Droplet to a larger size. As I said before, this is only required to build our production environment using Nix. Once the build is done, we can downsize again to the cheapest Droplet:

Choose a Droplet with 2 gigs of RAM to be on the safe side, and also enable the reserved IP option (this is a static IP that will never change):

Finally, turn on your Droplet, it’s time to log in to it using SSH.

Open a terminal on your computer and connect to your Droplet using SSH (starting now, user@local_computer refers to a terminal opened on your computer and root@droplet to an active ssh session inside your Droplet):

user@local_computer > ssh root@IP_ADDRESS_OF_YOUR_DROPLET

and add a folder that will contain the project’s files:

root@droplet > mkdir titanic_api

Great, let’s now copy our files to the Droplet using scp. Open a terminal on your computer, and navigate to where the default.nix file is. If you prefer doing this graphically, you can use Filezilla. Run the following command to copy the default.nix file to the Droplet:

user@local_computer > scp default.nix root@IP_ADDRESS_OF_YOUR_DROPLET:/root/titanic_api/

Now go back to the terminal that is logged into your Droplet. We now need to install Nix. For this, follow the instructions from the Determinate Systems installer, and run this line in the Droplet:

root@droplet > curl --proto '=https' --tlsv1.2 -sSf -L https://install.determinate.systems/nix | sh -s -- install

Pay attention to the final message once the installation is done:

Nix was installed successfully!
To get started using Nix, open a new shell or run `. /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh`

So run . /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh to start the Nix daemon. Ok so now comes the magic of Nix. You can now build the exact same environment that you used to build the pipeline on your computer in this Droplet. Simply run nix-build for the build process to start. I don’t really know how to describe how easy and awesome this is. You may be thinking well installing R and a couple of packages is not that hard, but let me remind you that we are using a Droplet that is running Ubuntu, which is likely NOT the operating system that you are running. Maybe you are on Windows, maybe you are on macOS, or maybe you’re running another Linux distribution. Whatever it is you’re using, it will be different from that Droplet. Even if you’re running Ubuntu on your computer, chances are that you’ve changed the CRAN repositories from the default Ubuntu ones to the Posit ones, or maybe you’re using r2u. Basically, the chances that you will have the exact same environment in that Droplet than the one running on your computer is basically 0. And if you’re already familiar with Docker, I think that you will admit that this is much, much easier than dockerizing your {plumber} api. If you don’t agree, please shoot me an email and tell me why, I’m honestly curious. Also, let me stress again that if you needed to install a package like {xlsx} that requires Java to be installed, Nix would install the right version of Java for you.

Once the environment is done building, you can downsize your Droplet. Go back to your Digital Ocean account, select that Droplet and choose “Resize Droplet”, and go back to the 6$ a month plan.

SSH back into the Droplet and copy the trained model trained_logreg.rds and the api file, titanic_api.R to the Droplet using scp or Filezilla. It’s time to run the api. To do so, the obvious way would be simply to start an R session and to execute the code to run the api. However, if something happens and the R session dies, the api won’t restart. Instead, I’m using a CRON job and an utility called run-one. This utility, pre-installed in Ubuntu, runs one (1) script at a time, and ensures that only one instance of said script is running. So by putting this in a CRON job (CRON is a scheduler, so it executes a script as often as you specify), run-one will try to run the script. If it’s still running, nothing happens, if the script is not running, it runs it.

So go back to your local computer, and create a new text file, call it run_api.sh and write the following text in it:

#!/bin/bash
while true
do
nix-shell /root/titanic_api/default.nix --run "Rscript -e 'plumber::pr_run(plumber::pr(\"/root/titanic_api/titanic_api.R\"), host = \"0.0.0.0\", port=80)'"
 sleep 10
done

then copy this to your VM using scp or Filezilla, to /root/titanic_api/run_api.sh. Then SSH back into your Droplet, go to where the script is using cd:

root@droplet > cd /root/titanic_api/

and make the script executable:

root@droplet > chmod +x run_api.sh

We’re almost done. Now, let’s edit the crontab, to specify that we want this script to be executed every hour using run-one (so if it’s running, nothing happens, if it died, it gets restarted). To edit the crontab, type crontab -e and select the editor you’re most comfortable with. If you have no idea, select the first option, nano. Using your keyboard keys, navigate all the way down and type:

*/60 * * * * run-one /root/titanic_api/run_api.sh

save the file by typing CTRL-X, and then type Y when asked Save modified buffer?, and then type the ENTER key when prompted for File name to write.

We are now ready to start the api. Make sure CRON restarts by running:

root@droplet > service cron reload

and then run the script using run-one:

root@droplet > run-one /root/titanic_api/run_api.sh &

run-one will now run the script and will ensure that only one instance of the script is running (the & character at the end means “run this in the background”). If for any reason the process dies, CRON will restart an instance of the script. We can now call our api using this curl command:

user@local_computer > curl -X GET "http://IP_ADDRESS_OF_YOUR_DROPLET/prediction?sex=female&age=45" -H "accept: */*"

If you don’t have curl installed, you can use this webservice. You should see this answer:

[{
    ".pred_class": "1"
}]

I’ll leave my Droplet running for a few days after I post this, so if you want you can try it out run this:

curl -X GET "http://142.93.164.182/prediction?sex=female&age=45" -H "accept: */*"

The answer is in the JSON format, and can now be ingested by some other script which can now process it further.

Conclusion

This was a long blog post. While it is part of my Nix series of blog posts, I almost didn’t talk about it, and this is actually the neat part. Nix made something that is usually difficult to solve trivially simple. Without Nix, the alternative would be to bundle the api with all its dependencies and an R interpreter using Docker or install everything by hand on the server. But the issue with Docker is that it’s not necessarily much easier than Nix, and you still have to make sure building the image is reproducible. So you have to make sure to use an image that ships with the right version of R and use {renv} to restore your packages. If you have system-level dependencies that are required, you also have to deal with those. Nix takes care of all of this for you, so that you can focus on all the other aspects of deployment, which take the bulk of the effort and time.

In the post I mentioned that you could also run Nix inside a Docker container. If you are already invested in Docker, Nix is still useful because you can use base NixOS images (NixOS is a Linux distribution that uses Nix as its package manager) or you could install Nix inside an Ubuntu image and then benefit from the reproducibility offered by Nix. Simply add RUN nix-build to your Dockerfile, and everything you need gets installed. You can even use Nix to build Docker images instead of writing a Dockerfile. The possibilities are endless!

Now, before you start building apis using R, you may want to read this blog post here as well. I found it quite interesting: it discusses the shortcomings of using R to build apis like I showed you here, which I think you need to know. If you have needs like the author of this blog post, then maybe R and {plumber} is not the right solution for you.

Next time, in part 4, I’ll either finally discuss how to do interactive work using a Nix environment, or I’ll discuss my package, {rix} in more detail. We’ll see!

Hope you enjoyed! If you found this blog post useful, you might want to follow me on Mastodon or twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebooks. You can also watch my videos on youtube. So much content for you to consoom!

Reproducible data science with Nix, part 2 -- running {targets} pipelines with Nix

Bruno Rodrigues — Thu, 20 Jul 2023 09:10:00 +0000

This is the second post in a series of posts about Nix. Disclaimer: I’m a super beginner with Nix. So this series of blog posts is more akin to notes that I’m taking while learning than a super detailed tutorial. So if you’re a Nix expert and read something stupid in here, that’s normal. This post is going to focus on R (obviously) but the ideas are applicable to any programming language.

So in part 1 I explained what Nix was and how you could use it to build reproducible development environments. Now, let’s go into more details and actually set up some environments and run a {targets} pipeline using it.

Obviously the first thing you should do is install Nix. A lot of what I’m showing here comes from the Nix.dev so if you want to install Nix, then look at the instructions here. If you’re using Windows, you’ll have to have WSL2 installed. If you don’t want to install Nix just yet, you can also play around with a NixOS Docker image. NixOS is a Linux distribution that uses the concepts of Nix for managing the whole operating system, and obviously comes with the Nix package manager installed. But if you’re using Nix inside Docker you won’t be able to work interactively with graphical applications like RStudio, due to how Docker works (but more on working interactively with IDEs in part 3 of this series, which I’m already drafting).

Assuming you have Nix installed, you should be able to run the following command in a terminal:

nix-shell -p sl

This will launch a Nix shell with the sl package installed. Because sl is not available, it’ll get installed on the fly, and you will get “dropped” into a Nix shell:

[nix-shell:~]$

You can now run sl and marvel at what it does (I won’t spoil you). You can quit the Nix shell by typing exit and you’ll go back to your usual terminal. If you try now to run sl it won’t work (unless you installed on your daily machine). So if you need to go back to that Nix shell and rerun sl, simply rerun:

nix-shell -p sl

This time you’ll be dropped into the Nix shell immediately and can run sl. So if you need to use R, simply run the following:

nix-shell -p R

and you’ll be dropped in a Nix shell with R. This version of R will be different than the one potentially already installed on your system, and it won’t have access to any R packages that you might have installed. This is because Nix environment are isolated from the rest of your system (well, not quite, but again, more on this in part 3). So you’d need to add packages as well (exit the Nix shell and run this command to add packages):

nix-shell -p R rPackages.dplyr rPackages.janitor

You can now start R in that Nix shell and load the {dplyr} and {janitor} packages. You might be wondering how I knew that I needed to type rPackages.dplyr to install {dplyr}. You can look for this information online. By the way, if a package uses the . character in its name, you should replace that . character by _ so to install {data.table} write rPackages.data_table.

So that’s nice and dandy, but not quite what we want. Instead, what we want is to be able to declare what we need in terms of packages, dependencies, etc, inside a file, and have Nix build an environment according to these specifications which we can then use for our daily needs. To do so, we need to write a so-called default.nix file. This is what such a file looks like:

{ pkgs ? import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/e11142026e2cef35ea52c9205703823df225c947.tar.gz") {} }:

with pkgs;

let
  my-pkgs = rWrapper.override {
    packages = with rPackages; [dplyr ggplot2 R];
  };
in
mkShell {
  buildInputs = [my-pkgs];
}

I wont discuss the intricate details of writing such a file just yet, because it’ll take too much time and I’ll be repeating what you can find on the Nix.dev website. I’ll give some pointers though. But for now, let’s assume that we already have such a default.nix file that we defined for our project, and see how we can use it to run a {targets} pipeline. I’ll explain how I write such files.

Running a {targets} pipeline using Nix

Let’s say I have this, more complex, default.nix file:

{ pkgs ? import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/8ad5e8132c5dcf977e308e7bf5517cc6cc0bf7d8.tar.gz") {} }:

with pkgs;

let
  my-pkgs = rWrapper.override {
    packages = with rPackages; [
      targets
      tarchetypes
      rmarkdown
    (buildRPackage {
      name = "housing";
      src = fetchgit {
        url = "https://github.com/rap4all/housing/";
        branchName = "fusen";
        rev = "1c860959310b80e67c41f7bbdc3e84cef00df18e";
        sha256 = "sha256-s4KGtfKQ7hL0sfDhGb4BpBpspfefBN6hf+XlslqyEn4=";
      };
    propagatedBuildInputs = [
        dplyr
        ggplot2
        janitor
        purrr
        readxl
        rlang
        rvest
        stringr
        tidyr
        ];
      })
    ];
  };
in
mkShell {
  buildInputs = [my-pkgs];
}

So the file above defines an environment that contains all the required packages to run a pipeline that you can find on this Github repository. What’s interesting is that I need to install a package that’s only been released on Github, the {housing} package that I wrote for the purposes of my book, and I can do so in that file as well, using the fetchgit() function. Nix has many such functions, called fetchers that simplify the process of downloading files from the internet, see here. This function takes some self-explanatory inputs as arguments, and two other arguments that might not be that self-explanatory: rev and sha256. rev is actually the commit on the Github repository. This commit is the one that I want to use for this particular project. So if I keep working on this package, then building an environment with this default.nix will always pull the source code as it was at that particular commit. sha256 is the hash of the downloaded repository. It makes sure that the files weren’t tampered with. How did I obtain that? Well, the simplest way is to set it to the empty string "" and then try to build the environment. This error message will pop-up:

error: hash mismatch in fixed-output derivation '/nix/store/449zx4p6x0yijym14q3jslg55kihzw66-housing-1c86095.drv':
         specified: sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
            got:    sha256-s4KGtfKQ7hL0sfDhGb4BpBpspfefBN6hf+XlslqyEn4=

So simply copy the hash from the last line, and rebuild! Then if in the future something happens to the files, you’ll know. Another interesting input is propagatedBuildInputs. These are simply the dependencies of the {housing} package. To find them, see the Imports: section of the DESCRIPTION file. There’s also the fetchFromGithub fetcher that I could have used, but unlike fetchgit, it is not possible to specify the branch name we want to use. Since here I wanted to get the code from the branch called fusen, I had to use fetchgit. The last thing I want to explain is the very first line:

{ pkgs ? import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/8ad5e8132c5dcf977e308e7bf5517cc6cc0bf7d8.tar.gz") {} }:

In particular the url. This url points to a specific release of nixpkgs, that ships the required version of R for this project, R version 4.2.2. How did I find this release of nixpkgs? There’s a handy service for that here. So using this service, I get the right commit hash for the release that install R version 4.2.2.

Ok, but before building the environment defined by this file, let me just say that I know what you’re thinking. Probably something along the lines of: damn it Bruno, this looks complicated and why should I care? Let me just use {renv}!! and I’m not going to lie, writing the above file from scratch didn’t take me long in typing, but it took me long in reading. I had to read quite a lot (look at part 1 for some nice references) before being comfortable enough to write it. But I’ll just say this:

continue reading, because I hope to convince you that Nix is really worth the effort
I’m working on a package that will help R users generate default.nix files like the one from above with minimal effort (more on this at the end of the blog post)

If you’re following along, instead of typing this file, you can clone this repository. This repository contains the default.nix file from above, and a {targets} pipeline that I will run in that environment.

Ok, so now let’s build the environment by running nix-build inside a terminal in the folder that contains this file. It should take a bit of time, because many of the packages will need to be built from source. But they will get built. Then, you can drop into a Nix shell using nix-shell and then type R, which will start the R session in that environment. You can then simply run targets::tar_make(), and you’ll see the file analyse.html appear, which is the output of the {targets} pipeline.

Before continuing, let me just make you realize three things:

we just ran a targets pipeline with all the needed dependencies which include not only package dependencies, but the right version of R (version 4.2.2) as well, and all required system dependencies;
we did so WITHOUT using any containerization tool like Docker;
the whole thing is completely reproducible; the exact same packages will forever be installed, regardless of when we build this environment, because I’m using a particular release of nixpkgs (8ad5e8132c5dcf977e308e7bf5517cc6cc0bf7d8) so each piece of software this release of Nix installs is going to stay constant.

And I need to stress completely reproducible. Because using {renv}+Docker, while providing a very nice solution, still has some issues. First of all, with Docker, the underlying operating system (often Ubuntu) evolves and changes through time. So lower level dependencies might change. And at some point in the future, that version of Ubuntu will not be supported anymore. So it won’t be possible to rebuild the image, because it won’t be possible to download any software into it. So either we build our Docker image and really need to make sure to keep it forever, or we need to port our pipeline to newer versions of Ubuntu, without any guarantee that it’s going to work exactly the same. Also, by defining Dockerfiles that build upon Dockerfiles that build upon Dockerfiles, it’s difficult to know what is actually installed in a particular image. This situation can of course be avoided by writing Dockerfiles in such a way that it doesn’t rely on any other Dockerfile, but that’s also a lot of effort. Now don’t get me wrong: I’m not saying Docker should be canceled. I still think that it has its place and that its perfectly fine to use it (I’ll take a project that uses {renv}+Docker any day over one that doesn’t!). But you should be aware of alternative ways of running pipelines in a reproducible way, and Nix is such a way.

Going back to our pipeline, we could also run the pipeline with this command:

nix-shell /path/to/default.nix --run "Rscript -e 'setwd(\"/path/to\");targets::tar_make()'"

but it’s a bit of a mouthful. What you could do instead is running the pipeline each time you drop into the nix shell by adding a so-called shellHook. For this, we need to change the default.nix file again. Add these lines in the mkShell function:

...
mkShell {
  buildInputs = [my-pkgs];
  shellHook = ''
     Rscript -e "targets::tar_make()"
  '';
}

Now, each time you drop into the Nix shell in the folder containing that default.nix file, targets::tar_make() get automatically executed. You can then inspect the results.

In the next blog post, I’ll show how we can use that environment with IDEs like RStudio, VS Code and Emacs to work interactively. But first, let me quickly talk about a package I’ve been working on to ease the process of writing default.nix files.

Rix: Reproducible Environments with Nix

I wrote a very early, experimental package called {rix} which will help write these default.nix files for us. {rix} is an R package that hopefully will make R users want to try out Nix for their development purposes. It aims to mimic the workflow of {renv}, or to be more exact, the workflow of what Python users do when starting a new project. Usually what they do is create a completely fresh environment using pyenv (or another similar tool). Using pyenv, Python developers can install a per project version of Python and Python packages, but unlike Nix, won’t install system-level dependencies as well.

If you want to install {rix}, run the following line in an R session:

devtools::install_github("b-rodrigues/rix")

You can then using the rix() function to create a default.nix file like so:

rix::rix(r_ver = "current",
         pkgs = c("dplyr", "janitor"),
         ide = "rstudio",
         path = ".")

This will create a default.nix file that Nix can use to build an environment that includes the current versions of R, {dplyr} and {janitor}, and RStudio as well. Yes you read that right: you need to have a per-project RStudio installation. The reason is that RStudio modifies environment variables and so your “locally” installed RStudio would not find the R version installed with Nix. This is not the case with other IDEs like VS Code or Emacs. If you want to have an environment with another version of R, simply run:

rix::rix(r_ver = "4.2.1",
         pkgs = c("dplyr", "janitor"),
         ide = "rstudio",
         path = ".")

and you’ll get an environment with R version 4.2.1. To see which versions are available, you can run rix::available_r(). Learn more about {rix} on its website. It’s in very early stages, and doesn’t handle packages that have only been released on Github, yet. And the interface might change. I’m thinking of making it possible to list the packages in a yaml file and then have rix() generate the default.nix file from the yaml file. This might be cleaner. There is already something like this called Nixml, so maybe I don’t even need to rewrite anything!

But I’ll discuss this is more detail next time, where I’ll explain how you can use development environments built with Nix using an IDE.

References

The great Nix.dev tutorials.
This blog post: Statistical Rethinking and Nix I referenced in part 1 as well, it helped me install my {housing} package from Github.
Nixml.

Reproducible data science with Nix, part 1 -- what is Nix

Bruno Rodrigues — Thu, 20 Jul 2023 09:08:09 +0000

This is the first of a (hopefully) series of posts about Nix. Disclaimer: I’m a super beginner with Nix. So this series of blog posts is more akin to notes that I’m taking while learning than a super detailed tutorial. So if you’re a Nix expert and read something stupid in here, that’s normal. This post is going to focus on R (obviously) but the ideas are applicable to any programming language.

To ensure that a project is reproducible you need to deal with at least four things:

Make sure that the required/correct version of R (or any other language) is installed;
Make sure that the required versions of packages are installed;
Make sure that system dependencies are installed (for example, you’d need a working Java installation to install the {rJava} R package on Linux);
Make sure that you can install all of this for the hardware you have on hand.

For the three first bullet points, the consensus seems to be a mixture of Docker to deal with system dependencies, {renv} for the packages (or {groundhog}, or a fixed CRAN snapshot like those Posit provides) and the R installation manager to install the correct version of R (unless you use a Docker image as base that already ships the required version by default). As for the last point, the only way out is to be able to compile the software for the target architecture. There’s a lot of moving pieces, and knowledge that you need to know and I even wrote a whole 522 pages book about all of this.

But it turns out that this is not the only solution. Docker + {renv} (or some other way to deal with packages) is likely the most popular way to ensure reproducibility of your projects, but there are other tools to achieve this. One such tool is called Nix.

Nix is a package manager for Linux distributions, macOS and apparently it even works on Windows if you enable WSL2. What’s a package manager? If you’re not a Linux user, you may not be aware. Let me explain it this way: in R, if you want to install a package to provide some functionality not included with a vanilla installation of R, you’d run this:

install.packages("dplyr")

It turns out that Linux distributions, like Ubuntu for example, work in a similar way, but for software that you’d usually install using an installer (at least on Windows). For example you could install Firefox on Ubuntu using:

sudo apt-get install firefox

(there’s also graphical interfaces that make this process “more user-friendly”). In Linux jargon, packages are simply what normies call software (or I guess it’s all “apps” these days). These packages get downloaded from so-called repositories (think of CRAN, the repository of R packages) but for any type of software that you might need to make your computer work: web browsers, office suites, multimedia software and so on.

So Nix is just another package manager that you can use to install software.

But what interests us is not using Nix to install Firefox, but instead to install R and the R packages that we require for our analysis (or any other programming language that we need). But why use Nix instead of the usual ways to install software on our operating systems?

The first thing that you should know is that Nix’s repository, nixpkgs, is huge. Humongously huge. As I’m writing these lines, there’s more than 80’000 pieces of software available, and the entirety of CRAN is also available through nixpkgs. So instead of installing R as you usually do and then use install.packages() to install packages, you could use Nix to handle everything. But still, why use Nix at all?

Nix has an interesting feature: using Nix, it is possible to install software in (relatively) isolated environments. So using Nix, you can install as many versions of R and R packages that you need. Suppose that you start working on a new project. As you start the project, with Nix, you would install a project-specific version of R and R packages that you would only use for that particular project. If you switch projects, you’d switch versions of R and R packages. If you are familiar with {renv}, you should see that this is exactly the same thing: the difference is that not only will you have a project-specific library of R packages, you will also have a project-specific R version. So if you start a project now, you’d have R version 4.2.3 installed (the latest version available in nixpkgs but not the latest version available, more on this later), with the accompagnying versions of R packages, for as long as the project lives (which can be a long time). If you start a project next year, then that project will have its own R, maybe R version 4.4.2 or something like that, and the set of required R packages that would be current at that time. This is because Nix always installs the software that you need in separate, (isolated) environments on your computer. So you can define an environment for one specific project.

But Nix even goes even further: not only can you install R and R packages using Nix (in isolated) project-specific environments, Nix even installs the required system dependencies. So for example if I need {rJava}, Nix will make sure to install the correct version of Java as well, always in that project-specific environment (so if you already some Java version installed on your system, there won’t be any interference).

What’s also pretty awesome, is that you can use a specific version of nixpkgs to always get exactly the same versions of all the software whenever you build that environment to run your project’s code. The environment gets defined in a simple plain-text file, and anyone using that file to build the environment will get exactly, byte by byte, the same environment as you when you initially started the project. And this also regardless of the operating system that is used.

So let me illustrate this. After installing Nix, I can define an environment by writing a file called default.nix that looks like this:

{ pkgs ? import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/e11142026e2cef35ea52c9205703823df225c947.tar.gz") {} }:

with pkgs;

let
  my-pkgs = rWrapper.override {
    packages = with rPackages; [ dplyr ggplot2 R];
  };
in
mkShell {
  buildInputs = [my-pkgs];
}

Now this certainly looks complicated! And it is. The entry cost to Nix is quite high, because, actually, Nix is more than a package manager. It is also a programming language, and this programming language gets used to configure environments. I won’t go too much into detail, but you’ll see in the first line that I’m using a specific version of nixpkgs that gets downloaded directly from Github. This means that all the software that I will install with that specific version of nixpkgs will always install the same software. This is what ensures that R and R packages are versioned. Basically, by using a specific version of nixpkgs, I pin all the versions of all the software that this particular version of Nix will ever install. I then define a variable called my-pkgs which lists the packages I want to install ({dplyr}, {ggplot2} and R).

By the way, this may look like it would take a lot of time to install because, after all, you need to install R, R packages and underlying system dependencies, but thankfully there is an online cache of binaries that gets automatically used by Nix (cache.nixos.org) for fast installations. If binaries are not available, sources get compiled.

I can now create an environment with these exact specifications using (in the directory where default.nix is):

nix-build

or I could use the R version from this environment to run some arbitrary code:

nix-shell /home/renv/default.nix --run "Rscript -e 'sessionInfo()'" >> /home/renv/sessionInfo.txt

(assuming my default.nix file is available in the /home/renv/ directory). This would build the environment on the fly and run sessionInfo() inside of it. Here are the contents of this sessionInfo.txt file:

R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)

Matrix products: default
BLAS/LAPACK: /nix/store/pbfs53rcnrzgjiaajf7xvwrfqq385ykv-blas-3/lib/libblas.so.3

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.2.3

This looks like any other output of the sessionInfo() function, but there is something quite unusual: the BLAS/LAPACK line:

BLAS/LAPACK: /nix/store/pbfs53rcnrzgjiaajf7xvwrfqq385ykv-blas-3/lib/libblas.so.3

BLAS is a library that R uses for linear algebra, matrix multiplication and vector operations. R usually ships with its own version of BLAS and LAPACK, but it’s also possible to use external ones. Here, we see that the path to the shared object libblas.so.3 is somewhere in /nix/store/..... /nix/store/ is where all the software gets installed. The long chain of seemingly random characters is a hash, essentially the unique identifier of that particular version of BLAS. This means that unlike Docker, if you’re using Nix you are also certain than these types of dependencies, that may have an impact on your results, also get handled properly, and that the exact same version you used will keep getting installed in the future. Docker images also evolve, and even if you use an LTS release of Ubuntu as a base, the underlying system packages will evolve through time as well. And there will be a point in time where this release will be abandoned (LTS releases receive 5 years of support), so if you need to rebuild a Docker images based on an LTS that doesn’t get supported anymore, you’re out of luck.

If you don’t want to install Nix just yet on your computer, you should know that there’s also a complete operating system called NixOS, that uses Nix as its package manager, and that there are Docker images that use NixOS as a base. So this means that you could use such an image and then build the environment (that is 100% completely reproducible) inside and run a container that will always produce the same output. To see an example of this, check out this Github repo. I’m writing a Dockerfile as I usually do, but actually I could even use Nix to define the Docker image for me, it’s that powerful!

Nix seems like a very powerful tool to me. But there are some “issues”:

As I stated above, the entry cost is quite high, because Nix is not “just a tool”, it’s a complete programming language that can even run pipelines, so you could technically even replace something like {targets} with it;
If you need to install specific versions of R packages, that are not pinned to dates, then Nix is not for you. Nix will always create a coherent environment with R and R packages that go together for a particular release of nixpkgs. If for some reason you need a very old version of {ggplot2} but a much more recent version of {dplyr}, using Nix won’t make this any easier than other methods;
There is no easy way (afaik) to find the version of nixpkgs that you need to download to find the version of R that you may need; UPDATE: turns out that there is such a simple tool, thanks to @shane@hachyderm.io for the telling me!
R packages (and I guess others for other programming languages as well) that are available on the stable channel of nixpkgs lag a bit behind their counterparts on CRAN. These usually all get updated whenever there’s a new release of R. Currently however, R is at version 4.2.3, but R should be at version 4.3.1 on the stable branch of nixpkgs. This can sometimes happen due to various reasons (there are actual human beings behind this that volunteer their time and they also have a life). There is however an “unstable” nixpkgs channel that contains bleeding edge versions of R packages (and R itself) if you really need the latest versions of packages (don’t worry about the “unstable” label, from my understanding this simply means that package have not been thoroughly tested yet, but is still pretty much rock-solid);
If you need something that is not on CRAN (or Bioconductor) then it’s still possible to use Nix to install these packages, but you’ll have to perform some manual configuration.

I will keep exploring Nix, and this is essentially my todo:

using my environment that I installed with Nix to work interactively;
write some tool that lets me specify an R version, a list of packages and it generates a default.nix file automagically (ideally it should also deal with packages only available on Github);
????
Profit!

Resources

Here are some of the resources I’ve been using:

nix.dev tutorials
INRIA’s Nix tutorial
Nix pills
Nix for Data Science
NixOS explained: NixOS is an entire Linux distribution that uses Nix as its package manager.
Blog post: Nix with R and devtools
Blog post: Statistical Rethinking and Nix
Blog post: Searching and installing old versions of Nix packages

Thanks

Many thanks to Justin Bedő, maintainer of the R package for Nix, for answering all my questions on Nix!

Software engineering techniques that non-programmers who write a lot of code can benefit from — the DRY WIT approach

Bruno Rodrigues — Tue, 07 Mar 2023 21:06:50 +0000

Data scientists, statisticians, analysts, researchers, and many other professionals write a lot of code.

Not only do they write a lot of code, but they must also read and review a lot of code as well. They either work in teams and need to review each other’s code, or need to be able to reproduce results from past projects, be it for peer review or auditing purposes. And yet, they never, or very rarely, get taught the tools and techniques that would make the process of writing, collaborating, reviewing and reproducing projects possible.

Which is truly unfortunate because software engineers face the same challenges and solved them decades ago. Software engineers developed a set of project management techniques and tools that non-programmers who write a lot of code could benefit from as well.

These tools and techniques can be used right from the start of a project at a minimal cost, such that the analysis is well-tested, well-documented, trustworthy and reproducible by design. Projects are going to be reproducible simply because they were engineered, from the start, to be reproducible.

But all these tools, frameworks and techniques boil down to two acronyms that I like to keep in my head at all times:

DRY: Don’t Repeat Yourself;
WIT: Write IT down.

DRY WIT: by systematically avoiding not to repeat yourself and by writing everything down, projects become well-tested, well-documented, trustworthy and reproducible by design. Why is that?

DRY: Don’t Repeat Yourself

Let’s start with DRY: what does it mean not having to repeat oneself? It means:

using functions instead of copy-and-pasting bits of code here and there;
using literate programming, to avoid having to copy and paste graphs and tables into word or pdf documents;
treating code as data and making use of templating.

The most widely used programming languages for data science/statistics, Python and R, both have first-class functions. This means that functions can be manipulated like any other object. So something like:

Reduce(`+`, seq(1:100))

## [1] 5050

where the function +() gets used as an argument of the higher-order Reduce() function is absolutely valid (and so is Python’s equivalent reduce from functools) and avoids having to use a for-loop which can lead to other issues. Generally speaking, the functional programming paradigm lends itself very naturally to data analysis tasks, and in my opinion data scientists and statisticians would benefit a lot from adopting this paradigm.

Literate programming is another tool that needs to be in the toolbox of any person analysing data. This is because at the end of the day, the results of an analysis need to be in some form of document. Without literate programming, this is how you would draft reports:

But with literate programming, this is how this loop would look like:

Quarto is the latest open-source scientific and technical publishing system that leverages Pandoc and supports R, Python, Julia and ObservableJs right out of the box.

Below is a little Quarto Hello World:

---
output: pdf
---

In this example we embed parts of the examples from the
\texttt{kruskal.test} help page into a LaTeX document:



```{r}
data (airquality)
kruskal.test(Ozone ~ Month, data = airquality)
```



which shows that the location parameter of the Ozone
distribution varies significantly from month to month.
Finally we include a boxplot of the data:



```{r, echo = FALSE}
boxplot(Ozone ~ Month, data = airquality)
```

Compiling this document results in the following:

Example from Leisch’s 2002 paper.

Of course, you could use Python code chunks instead of R, you could also compile this document to Word, or HTML, or anything else really. By combining code and prose, the process of data analysis gets streamlined and we don’t need to repeat ourselves copy and pasting images and tables into Word documents.

Finally, treating code as data is also quite useful. This means that it is possible to compute on the language itself. This is a more advanced topic, but definitely worth the effort. As an illustration, consider the following R toy example:

show_and_eval <- function(f, ...){
  f <- deparse(substitute(f))
  dots <- list(...)
  message("Evaluating: ", f, "() with arguments: ", deparse(dots))
  do.call(f, dots)
}

Running this function does the following:

show_and_eval(sqrt, 2)

## Evaluating: sqrt() with arguments: list(2)

## [1] 1.414214

show_and_eval(mean, x = c(NA, 1, 2))

## Evaluating: mean() with arguments: list(x = c(NA, 1, 2))

## [1] NA

show_and_eval(mean, x = c(NA, 1, 2), na.rm = TRUE)

## Evaluating: mean() with arguments: list(x = c(NA, 1, 2), na.rm = TRUE)

## [1] 1.5

This is incredibly useful when writing packages (to know more about these techniques in the R programming language, read the chapter Metaprogramming from Advanced R).

WIT: Write IT down

Now on the WIT bit: write it down. You’ve just written a function. To see if it works correctly, you test it in the interactive console. You execute the test, see that it works, and move on. But wait! What you just did is called a unit test. Instead of writing that in the console and then never use it ever again, write it down in a script. Now you’ve got a unit test for that function that you can execute each time you update that function’s code, and make sure that it keeps working as expected. There are many unit testing frameworks that can help you how to write unit tests consistently and run them automatically.

Documentation: write it down! How does the function work? What are its inputs? Its outputs? What else should the user know to make it work? Very often, documentation is but a series of comments in your scripts. That’s already nice, but using literate programming, you could also turn these comments into proper documentation. You could use docstrings in Python or {roxygen2} style comments in R.

Another classic: you correct some data manually in the raw dataset (very often a .csv or .xlsx file). For example, when dealing with data on people, sex is sometimes “M” or “F”, sometimes “Male” or “Female”, sometimes “1” or “0”. You spot a couple of inconsistencies and decide to quickly correct them by hand. Maybe only 3 men were coded as “Male” so you simply erase the “ale” and go on with your project. Stop!

Write it down!

Write a couple of lines of code that does the replacement for you. Not only will this leave a trace, it will ensure that when you get an update to that data in the future you don’t have to remember to have to change it by hand.

You should aim at completely eliminating any required manual intervention when building your project. A project that can be fully run by a machine is easier to debug, its execution can be scheduled and can be iterated over very quickly.

Something else that you should write down, or rather, let another tool do it for you: how you collaborate with your teammates. For this, you should be using Git. Who changed what part of what function when? If the project’s code is versioned, Git writes it down for you. You want to experiment with a new feature? Write it down by creating a new branch and going nuts. There’s something wrong in the code? Write it down as an issue on your versioning platform (usually Github).

There are many more topics that us disciples of the data could learn from software engineers. I’m currently working on a free ebook that you can read here that teaches these techniques. If this post opened your appetite, give the book a go!