DEV Community: Aaron Jacobs

Annotating Deployments in Grafana Using the Process Start Time Metric

Aaron Jacobs — Fri, 22 May 2020 00:00:00 +0000

Grafana sports a feature called Annotations that allow you to label a timestamp on a dashboard with meaningful events – most commonly deployments, campaigns, or outages:

(In this case annotating the simulated deployment of a FluentBit container, which I’ve used to forward container logs out of the cluster.)

Annotations can be input manually, but the only recommendations I’ve seen to generate them automatically is to use something like Loki, or teaching your CI/CD system to interact with Grafana’s web API. However, if you’re running a simple Prometheus + Grafana stack (say, using the Prometheus Operator on Kubernetes), you might be reticent to add more complexity to your setup just to get deployment annotations.

Fortunately, there’s a simpler alternative for this narrow case: you can use the process_start_time_seconds metric from Prometheus to get an approximate idea of when apps or pods were started. I haven’t seen this approach recommended elsewhere, which is the purpose of this post.

It turns out that process_start_time_seconds is exposed by almost all applications because it’s one of the standard metrics recommended by Prometheus itself and is exported by most client libraries automatically (including my own).

You can add annotations like the one in the image above as follows, assuming you have a namespace and pod template variable defined:

It’s important to understand that these annotations will show only when new processes are started, most likely because of a deployment but also during rescheduling, scaling, or a pod failure – but since it is likely that you’d want to know about those as well, perhaps that’s a good thing.

Structured Errors in Plumber APIs

Aaron Jacobs — Sat, 07 Dec 2019 00:00:00 +0000

If you’ve used the Plumber package to make R models or other code accessible to others via an API, sooner or later you will need to decide how to handle and report errors.

By default, Plumber will catch R-level errors (like calls to stop()) and report them to users of your API as a JSON-encoded error message with HTTP status code 500 – also known as Internal Server Error. This might look something like the following from the command line:

$ curl -v localhost:8000/
> GET /status HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.64.0
> Accept: */*
> 
< HTTP/1.1 500 Internal Server Error
< Date: Sun, 24 Mar 2019 22:56:27 GMT
< Content-Type: application/json
< Date: Sun, 24 Mar 2019 10:56:27 PM GMT
< Connection: close
< Content-Length: 97
< 
* Closing connection 0
{"error":["500 - Internal server error"],"message":["Error: Missing required 'id' parameter.\n"]}

There are two problems with this approach: first, it gives you almost zero control over how errors are reported to real users, and second, it’s badly behaved at the protocol level – HTTP status codes provide for much more granular and semantically meaningful error reporting.

In my view, the key to overcoming these problems is treating errors as more than simply a message and adding additional context when they are emitted. This is sometimes called structured error handling , and although it has not been used much historically in R, this may be changing. As you’ll see, we can take advantage of R’s powerful condition system to implement rich error handling and reporting for Plumber APIs with relative ease.

But first, it’s worth asking precisely what we want to get out of such an error handling system – that is, how can we distinguish errors we want our users to see?

Operational vs. Programmer Errors

Part of the issue here is that R (and Plumber) treat all errors as essentially the same, when in practice this is not the case.

The folks at Joyent coined the terms operational and programmer errors in the context of Javascript, and I think this distinction is apt for R as well. To quote the article at length:

People use the term “errors” to talk about both operational and programmer errors, but they’re really quite different. Operational errors are error conditions that all correct programs must deal with, and as long as they’re dealt with, they don’t necessarily indicate a bug or even a serious problem. “File not found” is an operational error, but it doesn’t necessarily mean anything’s wrong. It might just mean the program has to create the file it’s looking for first.

By contrast, programmer errors are bugs. They’re cases where you made a mistake, maybe by forgetting to validate user input, mistyping a variable name, or something like that. By definition there’s no way to handle those. If there were, you would have just used the error handling code in place of the code that caused the error!

In the context of Plumber APIs, we want to notify users of operational errors, because we require that they address these errors in order to use the API correctly. Programmer errors, on the other hand, might generate bizarre or misleading messages – so it’s not clear we want users to see them at all. At the same time, it is very important that we see them so that we can start to track down the underlying bugs that caused them.

Operational and programmer errors also have a very natural expression in terms of HTTP status codes; for the most part, 4xx codes are for client (operational) errors, and 5xxcodes are for server (programmer) errors.

By design, Plumber assumes that any error it encounters while running your code is a programmer error. This is the right default, but it does mean that you need to go out of your way to report operational errors instead. You can see this clearly by attempting to use R-style errors in a Plumber API.

R-Style Error Handling in Plumber

Suppose we have the following simple plumber.R file, which allows users to query the status of some familiar institutional patients:

records <- data.frame(
  id = 1:3,
  name = c("George", "Sally", "Michael"),
  admitted = c("2018-01-03", "2018-04-14", "2018-05-26"),
  released = c("2018-11-27", "2018-12-25", NA)
)

#* @param id:numeric* The patient's ID number.
#* @serializer unboxedJSON
#* @get /status
status <- function(id = NULL) {
  id <- as.integer(id)
  record <- records[records$id == id,]
  record$status <- if (!is.na(record$released)) "Released" else "Admitted"

  unclass(record)
}

You can run this in the usual way with

server <- plumber::plumber("plumber.R")
server$run(port = 8000, debug = TRUE, swagger = FALSE)

And normal queries should look like the following from the command line (withcurl and jq):

$ curl -s localhost:8000/status?id=2 | jq
{
  "id": 2,
  "name": "Sally",
  "admitted": "2018-04-14",
  "released": "2018-12-25",
  "status": "Released"
}
$ curl -s localhost:8000/status?id=3 | jq
{
  "id": 3,
  "name": "Michael",
  "admitted": "2018-05-26",
  "released": null,
  "status": "Admitted"
}

Of course, there are a number of ways the endpoint could fail, so let’s add some R-style error handling:

status <- function(id) {
  if (missing(id)) {
    stop("Missing required 'id' parameter.", call. = FALSE)
  }
  id <- suppressWarnings(as.integer(id))
  if (is.na(id)) {
    stop("The 'id' parameter must be a positive integer.", call. = FALSE)
  }

  record <- records[records$id == id,]
  if (nrow(record) == 0) {
    stop("No patient found with id: ", id, ".", call. = FALSE)
  }
  record$status <- if (!is.na(record$released)) "Released" else "Admitted"

  unclass(record)
}

We can then test some error conditions:

$ curl localhost:8000/status | jq
{
  "error": "500 - Internal server error",
  "message": "Error: Missing required 'id' parameter.\n"
}
$ curl localhost:8000/status?id=cats | jq
{
  "error": "500 - Internal server error",
  "message": "Error: The 'id' parameter must be a positive integer.\n"
}
$ curl localhost:8000/status?id=4 | jq
{
  "error": "500 - Internal server error",
  "message": "Error: No patient found with id: 4.\n"
}

You might notice that I passed debug = TRUE to the run() method above; this is because Plumber will only show the message field in the error responses in “debug” mode. This is partly for privacy – error messages could expose internal state you’d prefer users not to see – but it also in recognition of the point I made above: random R error messages are rarely helpful to users.

Plumber’s default error handler does a few useful things:

Prints the error to the console, so we can see it on the server side. This is absolutely essential for tracking down bugs.
Sets the status code to 500; and
Adds the error message to the response (as the message field you see above),but only when running in debug mode.

Unfortunately, this all means that we can’t use the default handler to send operational error messages back to the user. Instead, we can circumvent it by constructing error responses manually, or override it with smarter code.

Manual Error Reporting

To generate useful operational errors for users, we need to do two things: first, come up with a meaningful payload for errors; and second, ensure that errors set an appropriate HTTP status code. Both of these can be accomplished by manually modifying the response object that Plumber exposes as the magic parameter res.

There are many, many different takes on how to report errors in JSON; I’m going to use a pretty simple one here and include just a status code¹ and a message. For example:

{
  "status": 400,
  "message": "Missing required parameter."
}

Similarly, there is some debate on how to map errors like “invalid parameter” to HTTP status codes, but here I’ll use 400. Both 422 and 409 are common alternatives. For the case when a patient can’t be found, I also think it make sense to use 404.

status <- function(id, res) {
  if (missing(id)) {
    res$status <- 400
    res$body <- jsonlite::toJSON(auto_unbox = TRUE, list(
      status = 400,
      message = "Missing required 'id' parameter."
    ))
    return(res)
  }
  id <- suppressWarnings(as.integer(id))
  if (is.na(id)) {
    res$status <- 400
    res$body <- jsonlite::toJSON(auto_unbox = TRUE, list(
      status = 400,
      message = "The 'id' parameter must be a positive integer."
    ))
    return(res)
  }

  record <- records[records$id == id,]
  if (nrow(record) == 0) {
    res$status <- 404
    res$body <- jsonlite::toJSON(auto_unbox = TRUE, list(
      status = 404,
      message = paste0("No patient found with id: ", id, ".")
    ))
    return(res)
  }
  record$status <- if (!is.na(record$released)) "Released" else "Admitted"

  unclass(record)
}

This gives us much nicer, more meaningful errors we can safely pass down to users of the API:

$ curl -s localhost:8000/status | jq
{
  "status": 400,
  "message": "Missing required 'id' parameter."
}
$ curl -s localhost:8000/status?id=moose | jq
{
  "status": 400,
  "message": "The 'id' parameter must be a positive integer."
}
$ curl -s localhost:8000/status?id=4 | jq
{
  "status": 404,
  "message": "No patient found with id: 4."
}

The code to manipulate res objects for error handling ends up involving a lot of copy & paste, especially for larger APIs where you want to report certain classes of errors in a standard way. Ideally, we want to provide some helper functions so that API authors do the right thing without needing to copy so much code.

Emitting Errors via Custom Conditions

The underlying machinery that powers R’s stop(), warning(), and message()is the concept of a condition. We can construct and “signal” error-like conditions using a simple S3 object that inherits from the "error" class:

api_error <- function(message, status) {
  err <- structure(
    list(message = message, status = status),
    class = c("api_error", "error", "condition")
  )
  signalCondition(err)
}

# Works like stop():
api_error("Bad request.", 400)
#> Error: Bad request.

Moreover, since these are S3 objects, we can use the class attribute to sort out which errors are purposeful, operational errors that need to be reported to the user, and those that are not:

error_handler <- function(req, res, err) {
  if (!inherits(err, "api_error")) {
    res$status <- 500
    res$body <- "{\"status\":500,\"message\":\"Internal server error.\"}"

    # Print the internal error so we can see it from the server side. A more
    # robust implementation would use proper logging.
    print(err)
  } else {
    # We know that the message is intended to be user-facing.
    res$status <- err$status
    res$body <- sprintf(
      "{\"status\":%d,\"message\":\"%s\"}", err$status, err$message
    )
  }
  res
}

# Add this to the server with
# server$setErrorHandler(error_handler)

I’d also advise writing some helper methods, like the following:

not_found <- function(message = "Not found.") {
  api_error(message = message, status = 404)
}

missing_params <- function(message = "Missing required parameters.") {
  api_error(message = message, status = 400)
}

invalid_params <- function(message = "Invalid parameter value(s).") {
  api_error(message = message, status = 400)
}

These helper functions allow us to simplify and clarify the code so that it is as concise and familiar looking as it was when we were using stop():

status <- function(id, res) {
  if (missing(id)) {
    missing_params("Missing required 'id' parameter.")
  }
  id <- suppressWarnings(as.integer(id))
  if (is.na(id)) {
    invalid_params("The 'id' parameter must be a positive integer.")
  }

  record <- records[records$id == id,]
  if (nrow(record) == 0) {
    not_found(paste0("No patient found with id: ", id, "."))
  }
  record$status <- if (!is.na(record$released)) "Released" else "Admitted"

  unclass(record)
}

Using a custom error handler and the structured error support of S3 conditions, we now have a way to emit operational errors with ease and a consistent JSON error reporting format. This is an essential piece of providing a robust, user-friendly Plumber API.

I like having the original status code as part of the error payload. That way, even if I don’t have access to the full original request (e.g. someone just copy & pasted the error message to me, or it’s not in the logs, or a proxy along the way did not forward it appropriately), I still have a good idea where to look. ↩

Writing Proprietary R Packages

Aaron Jacobs — Tue, 26 Nov 2019 00:00:00 +0000

Author’s note: this is a lightly modified version of the talk I gave at the GTA R User’s Group in May of this year. You can find the original slides here. Unfortunately, the talk was not recorded.

As I have noted before, most resources for R package authors are pitched at those writing open-source packages — usually hosted on GitHub, and with the goal of ending up on CRAN.

These are valuable resources, and reflect the healthy free and open-source (FOSS) R package ecosystem. But it is not the whole story. Many R users, especially those working as data scientists in industry, can and should be writing packages for internal use within their company or organisation.

Yet there is comparatively little out there about how to actually put together high-quality packages in these environments.

This post is my attempt to address that gap.

At work we have more than 50 internal R packages, and I have been heavily involved in building up the culture and tooling we use to make managing those packages possible over the last two years.

I’ll focus on three major themes: code, tooling, and culture.

Why Should You Write R Packages for Internal Use?

At the outset, it is worth repeating that R packages are the best way to share R code and keep it well-maintained and reliable. This matters even more inside an organisation or when you are part of a team.

The most salient reason why this is the case is that common tools to make your R code robust, portable, and well-documented are only available for use with packages: R CMD check, testthat, and roxygen are all good examples.

I would push this further than most, and suggest you put as much R code as you can get away with inside packages. For instance, we put all production models in R packages, much of our ETL, and a good portion of our Shiny apps – which has recently become a lot easier due to the excellent golem package.

The Code: What Can Internal Packages Contain?

It is a oft-repeated observation that when you find yourself copy/pasting the same function or snippet of R code between projects, it might be time to push that code into a package.

Inside a business or organisation, functions that I’ve seen generally fall into a small number of categories:

Tools for making data easy to access and use correctly. For example, accessing the right database with the right credentials.
Plot themes and other internal conventions ported to R, such as presentation templates.
Business logic, which are conversions and routines that are highly specific to your organisation or industry; and
Encoding process in code – e.g. automation of team or company-specific tasks.

As an illustration, all of the following are real functions from our internal packages:

# Accessing data.
pull_data(...)
mongo_collection(...)

# Plot themes and templates.
theme_pinnacle(...)
pinnacle_presentation(...)

# Business logic.
vig_cents_to_percent(...)
get_clp_est(..)

# Encoding process.
send_to_slack(...)
rnd_release(...)

There could be other good candidates. As I mentioned, we put models, Shiny applications, and ETL into packages when possible, some of which turns into reusable code.

The Tooling: Limitations and Opportunities

Unlike the CRAN/FOSS world, in an organisation you’ll have limited power to choose the tools your organisation (already) uses for collaboration and development. And, unless you’ve got a compelling reason, you should adopt your organisation’s existing tools.

For example, my organisation uses TeamCity for continuous integration, which does not support R out of the box. In order to get access to shared CI resources, we had to make this possible (which involved using Docker and some custom scripts).

The upside of this is that we could then hook into integrations used by the rest of the organisation. For instance, Slack alerts for R packages that pass or fail their CI tests:

Because everyone is required to use the same tools, you can sometimes turn this to your advantage and leverage a shared tool for opportunities you might not otherwise have.

For example, since everyone is required to be in the same Slack channel or read their corporate email, you can ensure that everyone is notified of important R package releases – ie. this is our Slack bot that posts R package releases:

This kind of broadcast is genuinely impossible for CRAN packages, because there is simply no medium by which to contact all R users. FOSS communities are and always will be more decentralised and heterogeneous.

Moreover, tools can create data (e.g. releases, downloads, commit activity, email messages, Slack messages) that can be analysed to help you measure and understand bottlenecks or problems with your internal processes, since you know they represent the whole picture. I’ve used these data to decide what would help my team by more productive on several occasions.

The Culture: Authorship vs. Maintainership

Ultimately, I think the biggest difference between FOSS and proprietary R packages is a cultural one.

Most CRAN packages are written by someone trying to scratch an itch. Maybe that’s a new statistical method or data transformation, or a new approach to a older ones; maybe it’s a new data format that you need access to from R; or maybe it’s just a bundle of cool stuff you’ve done that you want to make more widely accessible.

There are some consequences of this model. Even if the package is released to CRAN, the expectation is that the original author will (1) design the APIs; (2) write the code; and (3) maintain the package. The author will make all the major decisions about where the package goes and be the authority on why code works the way it does.

The author unconsciously acts as though they will maintain the code forever (or more likely: until the package is abandoned).

I call this an Authorship paradigm.

These packages have a bus factor of exactly 1 – that is, if the author gets hit by a bus, that’s probably the end of the project. It is not an usual state of affairs for FOSS software, by any means.

Things are dramatically different for proprietary code:

You likely inherited a codebase you did not design, and it is your job to maintain it, to understand it, and to be the person to ask about bugs and new features.
You will not maintain new code you have designed/written forever. At some point it will likely be someone else’s job.

I call this a Maintainership paradigm.

What does this imply about writing R packages? Well, you should write the kind of R code you’d like to maintain.

What Kind of R Package Would You Like to Maintain?

This is not a trick question. You want to see

Clear R source code with helpful comments.
Good documentation, with clear explanations and examples, and an overview of the main package features in a README file.
A test suite, both as additional examples and also to help prevent you from introducing regressions.
A documented history: an up-to-date NEWS file and clear git commit history.

In other words, pretty much all of the usual advice. The difference is that you are primarily motivated by collaboration with current and future coworkers.

In sum, I think that there is quite a lot of overlap between “best practice” package development in FOSS and proprietary environments. The main differences arise from the motivations of package authors, the tools and ecosystems they might operate in, and the nature of the code they might write.