Making R Fly with Rcpp

2017-11-22 · 5 min

What should we do when we need to make our script run faster? Usually the first idea we have is optimizing the code: reducing the amount of loops, making data structures smaller, using parallel computing, etc… But when it comes to R, we can speed up our code almost without changing its structure.

In this post I’ll give you a basic overview of the Rcpp package, a tool that allows us to run C++ code from within R.

The basics

C++ is a very famous and versatile programming language. As Wikipedia points out, “it has imperative, object-oriented and generic programming features, while also providing facilities for low-level memory manipulation”.

An interesting characteristic of C++ is that it is extremely fast. Unlike R, it is a compiled language, with static typing and that doesn’t supply the user with too many abstractions, allowing the code to execute with impressive efficiency.

To explore the benefits that C++ can bring to your R code, install and load the Rcpp package with the commands below:

install.packages("Rcpp")
library(Rcpp)

Now let’s take a look at a simple example of how to call C++ code from within R. The easiest way to do this is with the cppFunction(): it takes a string that will be interpreted as a C++ function.

add_r <- function(x, y, z) {
  sum = x + y + z
  return(sum)
}

cppFunction(
  "int add_c(int x, int y, int z) {
    int sum = x + y + z;
    return sum;
  }")

add_r(1, 2, 3)
#> [1] 6

add_c(1, 2, 3)
#> [1] 6

As we can see in the example above, both functions have the same behavior despite some superficial differences. Note how we always have to declare the types of the variables in C++! Using the int keyword we make it clear to the compiler that a variable will be of type integer and even that a function will return a value of type integer. Another important thing to remember is that we need to add a semi-colon to the end of every C++ command.

Vectors

Normally C++ would have enormous differences in relation to R regarding how they deal with vectors, but (lucky us!) Rcpp has a library of structures that abstract R’s behavior. In the next example we have a function that takes a number and a numeric vector, computes the euclidean distance between the value and the vector and returns a numeric vector as output.

dist_r <- function(x, ys) {
  sqrt((x - ys) ^ 2)
}

cppFunction(
  "NumericVector dist_c(double x, NumericVector ys) {
    int n = ys.size();
    NumericVector out(n);
  
    for(int i = 0; i < n; i++) {
      out[i] = sqrt(pow(ys[i] - x, 2.0));
    }
    return out;
  }")

dist_r(10, 20:25)
#> [1] 10 11 12 13 14 15

dist_c(10, 20:25)
#> [1] 10 11 12 13 14 15

The NumericVector structure abstracts a numeric vector from R, allowing us to work with it in a more familiar manner. With the .size() method we get its length and can declare a new one with the NumericVector name(length); constructor. The only fundamentally different thing between C++ and R is that the first doesn’t have properly vectorized operations, meaning that we have to use loops for every iteration.

Maximum speed

Certain aspects of R’s philosophy make it an extremely versatile language, but this has its price. Some areas where R’s performance is a bit lacking are non-vectorizable loops (when an iteration depends on the previous one), recursive functions, and complex data structures.

In these and many other situations, using C++ can be very advantageous. In the following example we’ll see the difference in performance between a loop written in C++ and one in R; note that this isn’t even one of the 3 situations mentioned above and that the C++ code is still 6 times faster.

sum_r <- function(v) {
  total <- 0
  for (e in v) {
    if (e < 0) { total = total - e }
    else if (e > 0.75) { total = total + e/2 }
    else { total = total + e }
  }
  
  return(total)
}

cppFunction(
  "double sum_c(NumericVector v) {
    double total = 0;
    for (int i = 0; i < v.size(); i++) {
      if (v[i] < 0) { total -= v[i]; }
      else if (v[i] > 0.75) { total += v[i]/2; }
      else { total += v[i]; }
    }
    return(total);
  }")

v <- runif(100000, -1, 1)
microbenchmark::microbenchmark(sum_r(v), sum_c(v))
#> Unit: milliseconds
#>       expr      min       lq     mean   median       uq       max neval
#>  soma_r(v) 6.105048 6.436608 6.911819 6.718456 7.183266 11.610624   100
#>  soma_c(v) 1.045805 1.063956 1.161585 1.097920 1.210052  1.955702   100

Obs.: The symbols += and -= stand for a = a +/- b, while ++ stands for a = a + 1.

Conclusion

With the Rcpp package we can run C++ code from within R. With this technique we can optimize our code or even have access to complex data structures from C++.

To learn more about this subject take a look at the tutorial written by Hadley Wickham in the book Advanced R. I also recommend Rcpp’s own web page and its extensive gallery of examples.

P.S.: If you want the complete code for this tutorial, I’ve made it available as a Gist.