I've never thought of myself as a developer and honestly, neither have most other people. My job for the last several years has been anything but, despite the juice I try to squeeze from that orange. But there was a glorious time back in 2018 when I was part of an awesome team and we would challenge each other solve problems in interesting ways (not coincidentally when I last posted here). One of these problems seemed pretty minor at first but turned into a fun solve - how do you simultaneously substitute multiple strings safely? So you have 'hey, ho, let's go!' and you want to replace 'hey' with 'ho' and 'ho' with 'hey' (it's a dumb example, I know). If you do your substitutions one at a time you end up with either 'hey, hey' or 'ho, ho' depending on the order. In the R language (I grew up as a statistician and it will always be my first love) I solved it and I felt pretty proud. I had a blog at the time and I gloated about how other attempts to solve the problem in R invariably ran into problems because they took shortcuts. I humbly acknowledged there was a performance difference between the different solutions but crowed that nobody would care! Safety was all that mattered! The author of one of the packages (much bigger than mine) noted that the performance difference became significant on larger bodies of text and that for most practical use cases, the edge cases weren't a real risk. Humbled (and honored that he still included my package as a dependency in case safety was critical) I started thinking about how to solve the problem.
R, like many high level languages, is built on lower level languages, specifically C and Fortran and provided mechanisms for writing functions in those languages to be more performant in addition to having support for C++ which can call R through it's C API.
I have finally gotten around to converting key functions to C++ and oh my, is the performance better. I had two benchmarks, first, a list of 10,000 strings with 20 matches and 20 replacements. The second was a single string with nearly 200,000 characters (equivalent to more than a 100 page book) also with 20 matches and 20 replacements. Average time to complete the work is listed in the table below.
| Task | Base R | C++ | Improvement |
|---|---|---|---|
| 10,000 Strings | 4s | 1s | 4x |
| 200,000 Characters | 48s | 9.5s | 5x |
I thought I must have done something wrong when I saw the results initially and so I reran them only to get the same results again. So I'm excited to be publishing up version 2.0 of mgsub to CRAN. There may be no new functionality, but there's a lot more performance.
And who knows? Maybe this will take me from the 1154th position in lifetime downloads to 1153rd!
Top comments (0)