Chapter 13 Pipelines in R
So far we’ve used R functions in the traditional way. Every function is followed by a pair of parentheses (or brackets if you’re on my side of the pond) and the function acts on arguments that are inserted into these brackets. So if you have a vector of numbers and want to know their mean, you use the name of the vector as an argument for the mean()
function.
# Set up vector
<- c(-3, 8, 17, -1, 0.5)
V1
# Calculate mean
mean(V1)
1] 4.3 [
If you want to carry out more than one operation on your vector, you have two options. Firstly you can save the output from one operation and then feed it to the next, like this.
# Set up vector
<- c(-3, 8, 17, -1, 0.5)
V1
# Generate new vector of absolute values
<- abs(V1)
V2
# Calculate the mean of the absolute values
mean(V2)
1] 5.9 [
Alternatively you can nest your function calls like this.
#Set up vector
<- c(-3, 8, 17, -1, 0.5)
V1
#Calculate the mean of the absolute values
mean(abs(V1))
1] 5.9 [
For even greater efficiency you could do this last example in a single line of code without generating the V1
object at all. c()
is just another function.
# Calculate the mean of the absolute values
mean(abs(c(-3, 8, 17, -1, 0.5)))
1] 5.9 [
These different approaches have different strengths and weaknesses. The slow and steady approach of setting up a vector, calculating the absolute values and saving that as a new object and then calculating the mean has some benefits: it’s simple and the code is easy to read. It also has some drawbacks: it’s inefficient, and generates new objects in your workspace which then have to be deleted or they will just hang around and cause trouble.
Nesting your function calls is obviously more efficient, but it is harder to read the code. As we add more operations we add extra levels of nested brackets (or parentheses if you wish) and it’s common to see 4, 5, or 6 levels of nested brackets in R code. This is often hard to check and a frequent cause of errors.
# Calculate the mean of the square root of the absolute values.
# NB in case you're thinking "no-one needs to do that",
# I did it just the other day on some data on asymmetry
# in antelope horns.
mean(sqrt(abs(c(-3, 8, 17, -1, 0.5))))
1] 2.0781 [
In the last few years some alternative ways of dealing with multiple operations on objects in R have become available that are efficient, don’t generate spurious objects in the workspace but also lead to code that is sensible and easy to understand. These rely on the concept of the pipe — an operator that takes something from the left hand side of itself and passes it to a function call on the right. Because this is R, and so little is simple, there are two rather different options for doing this.
13.1 Magrittr pipes
The most widely used pipe option in R is enabled by loading either the magrittr
package or the dplyr
package (see next chapter for more on dplyr
). It was originally written by Danish data scientist Stefan Milton Bache and first released in 2014. The package is named magrittr
because ceci n’est pas une pipe, a reference to the classic surrealist work The Treachery of Images painted in 1929 by Rene Magritte. The magrittr
pipe functionality was then incorporated into dplyr
because it integrated so well with the data manipulation aims of the dplyr
package.
How does it work? The pipe operator in magrittr
is this symbol: %>%
. It takes what’s on its left and pipes it to what’s on its right.
library(magrittr)
c(-3, 8, 17, -1, 0.5) %>% mean()
1] 4.3 [
This has generated a vector of data and calculated its mean in a single, easy to read line of code without adding any objects to the workspace. To carry out multiple operations we can replace our unpleasant mess of nested brackets with a series of pipes chained together: a pipeline.
# Do it the old way
mean(sqrt(abs(c(-3, 8, 17, -1, 0.5))))
1] 2.0781
[
# Using pipes
c(-3, 8, 17, -1, 0.5) %>% abs() %>% sqrt() %>% mean()
1] 2.0781 [
The pipeline c(-3, 8, 17, -1, 0.5) %>% abs() %>% sqrt() %>% mean
can be read easily from left to right, avoids nesting and generates the same output as the nested code. To make it even more readable we can write it on mutiple lines.
c(-3, 8, 17, -1, 0.5) %>%
abs() %>%
sqrt() %>%
mean()
1] 2.0781 [
13.1.1 Saving the output of a pipeline
If we want to save the output of the pipeline to an object we can do that in a couple of ways. If you want to save the pipeline output to a new object you can just use the allocation symbol like this.
# Generate P1 object
<-
P1 c(-3, 8, 17, -1, 0.5) %>%
abs() %>%
sqrt() %>%
mean()
# Print P1
P11] 2.0781 [
Alternatively if you want to feed something into a pipe and then save the pipe output to the original object you can use the %<>% operator. Here’s an example.
#Set up vector
<- c(-3, 8, 17, -1, 0.5)
V1
# Replace vector with the square roots of the unsigned values
%<>% abs() %>% sqrt()
V1
# Print V1
V11] 1.73205 2.82843 4.12311 1.00000 0.70711 [
In this piece of code we first set up our vector, V1
. We then take that vector and feed it into a pipeline which firstly generates the absolute values of the numbers in the pipeline and then calculates the square root of each. Because we started with the %<>% operator, the output of the pipeline is then saved back to the original object, V1
.
A quick note here, the %<>% operator is not available if you are using dplyr
unless you also load the magrittr
package. dplyr
only makes the basic %>%
pipe operator available.
13.1.2 Magrittr pipes and function arguments
So far we’ve used simple functions in our pipelines. These functions all take a single argument, so sqrt()
just takes a vector of numbers and calculates the square root. The way %>%
works is that the data that is passed to a function is assumed to be the first argument for that function, unless you tell it otherwise. In fact, if the data going along the pipeline is the only argument needed for a function you can even leave out the brackets after the function name if you’re using magrittr
, so V1 %>% mean
works as well as V1 %>% mean()
.
What if you want to pass your data to an argument that is not the first one for a function? Here’s a simple example. We want to do a series of things. First, generate 1000 random numbers drawn from a standard normal distribution rnorm(1000)
, then round them to the nearest whole number with round()
, then calculate the maximum of our vector using max()
. This can all be done easily with a pipeline because round()
and max()
both work if we just give them a single argument.
rnorm(1000) %>%
round() %>%
max()
1] 4 [
Nice and easy. Now we want to generate a series of numbers from 1 to the value generated by our pipeline. We can do this using the seq()
function but this takes two arguments, from =
and to =
and the to =
argument comes second. Fortunately magrittr
has a simple way of doing this: to get seq()
to use the value that we’re piping into it as the second argument, we just put a full stop (period) — often referred to as the placeholder symbol — where we want our piped data to go, so seq(from = 1, to = .)
rnorm(1000) %>%
round() %>%
max() %>%
seq(from = 1, to = .)
1] 1 2 3 4 [
As a second example, you might want to edit a data frame and then pass it to a plot()
function to draw a graph. Here are the data on fish leucocyte counts we looked at in the last chapter.
<- read.csv("Data/Counts.csv", header=T)
fish
fish
Sex Temp Weight Count1 M Hot 73.25 282
2 M Hot 69.28 170
3 F Hot 81.38 151
4 M Hot 66.07 238
5 F Cold 83.32 136
6 F Cold 63.06 203
7 M Cold 78.48 312
8 M Cold 55.38 274
If we wanted to use a pipeline firstly to select only those rows corresponding to fish which weighed more than 70mg and then to plot leucocyte count against weight we could do it this way.
%>%
fish subset(Weight >= 70) %>%
plot(Count ~ Weight, data = .)
Let’s go through this in more detail. We pass the fish
data frame to the subset()
function which we used in the last chapter. This will select only those rows of data where Weight
is greater than or equal to 70. In this case the data frame is assumed to be the first argument for subset()
, so we don’t need to worry about it. Subset()
also takes a further argument which tells it the criterion for selecting data, Weight >= 79
which we just add into the function call.
The output of subset()
is then piped to our plot()
function. There’s a lot (an awful lot) more on plotting data coming up, but for the moment you just need to know that the first argument for plot()
tells it what data to draw and on which axes, so Count ~ Weight
is telling plot()
to draw Count
on the y-axis and Weight
on the x-axis. plot()
can also take an argument data =
which tells it which data frame to use for the data it’s plotting: as an example, if we were going to use the whole fish
data frame rather than the subsetted one we could just write plot(Count ~ Weight, data = fish)
. That data =
is the argument that we want the output from our pipeline to be named in, so we just use the placeholder symbol by putting a full stop (period) where the piped data should be in a function argument, hence plot(Count ~ Weight, data = .)
.
13.2 Base R pipes
Magrittr
has been available for a few years now and the ability to pipe data between functions has proven very useful. This has been noticed by the R development community and we now have a pipe operator built into the base installation of R, from version 4.1 onwards. This new pipe is |>
. Some things are similar to %>%
but some things are different. Let’s have a look.
For basic operations |>
works in a way that is superficially very similar to %>%
. Here’s our example from the beginning of the chapter written with the magrittr
operator, as we used it before.
c(-3, 8, 17, -1, 0.5) %>%
abs() %>%
sqrt() %>%
mean()
1] 2.0781 [
Here it is with the new pipe from base R.
c(-3, 8, 17, -1, 0.5) |>
abs() |>
sqrt() |>
mean()
1] 2.0781 [
Functionally these are exactly equivalent and give the same output. The |>
operator pipes from left to right in just the same way as %>%
and we get the same result. Internally the two work in rather different ways but that doesn’t affect the output. If you want to save the output of a pipeline to an object then once again you can use the allocation <-
symbol.
# Generate vector, calculate the mean of the unsigned square roots
# and save to object P1
<- c(-3, 8, 17, -1, 0.5) |>
P1 abs() |>
sqrt() |>
mean()
# Print P1
P11] 2.0781 [
13.2.1 Placeholders and base R pipes
The big difference between the two pipes comes when you want to assign the piped material to a function argument other than the first one. Unlike the magrittr
pipe, the base R pipe does not have a built in placeholder symbol. There are two ways to deal with this. The easiest is to use what is called the pipebind
operator,=>
, which lets you specify any symbol to be a placeholder rather than restricting you to .
. This is in a lot of ways better than the magrittr
option because full stops (periods) also appear elsewhere in R code, for example in formulae as used to specify statistical models, and this could cause confusion. The only problem with using the pipebind symbol is that it’s not fully implemented or supported in R as yet (hopefully it will be soon). If you try to use it and get this error:
Error: '=>' is disabled; set '_R_USE_PIPEBIND_' envvar to a true value to enable it
then you need to enable it using this piece of code.
Sys.setenv(`_R_USE_PIPEBIND_` = TRUE)
Here’s our example pipeline with a placeholder from before, but this time with the base R pipe and the pipebind operator.
rnorm(1000) |>
round() |>
max() |>
=> seq(from = 1, to = d)
d 1] 1 2 3 4 [
Here we’ve used the letter d
as our placeholder and specified it before the function in question using =>
.
13.2.2 Lambda functions and base R pipes
The alternative to the pipebind symbol if you’re using base R pipes is to use something called a lambda function. This is a bit advanced and requires you to know how to write your own functions, which is not something we’ve covered yet. We’ll look at this briefly but I would recommend sticking with the simpler pipebind option until you’re comfortable with writing your own functions. If the example below doesn’t make sense then ignore it until you’re more familiar with writing functions in R.
The idea is that you can write a function within your code which enables the piped data to be directed to the correct argument. Here’s a very simple example: we want to generate a random number between one and ten and then generate a set of ten numbers drawn from a normal distribution with sd = 1 and mean equal to our random number. This code would do this without a pipeline.
# Generate random number
<- runif(n = 1, min = 0, max = 10)
X1
# Generate 10 numbers from normal distribution with mean = X1 and sd = 1
rnorm(n = 10, mean = X1, sd = 1)
1] 8.1665 7.6003 8.5623 8.1483 7.8117 6.9847 7.7760 8.5618 9.0014 7.1214 [
We can also do this by nesting the two functions, which avoids generating new objects but makes our code a little harder to read.
rnorm(n = 10, mean = runif(n = 1, min = 0, max = 10), sd = 1)
1] 8.1665 7.6003 8.5623 8.1483 7.8117 6.9847 7.7760 8.5618 9.0014 7.1214 [
To do this with the base R pipe, we can use the pipebind operator:
runif(n = 1, min = 0, max = 10) |>
=> rnorm(n = 10, mean = m, sd = 1)
m 1] 8.1665 7.6003 8.5623 8.1483 7.8117 6.9847 7.7760 8.5618 9.0014 7.1214 [
Alternatively we can write a little function within our pipeline.
runif(n = 1, min = 0, max = 10) |>
function(x) rnorm(n = 10, mean = x, sd = 1))()
(1] 8.1665 7.6003 8.5623 8.1483 7.8117 6.9847 7.7760 8.5618 9.0014 7.1214 [
As a final alternative, you c an use the shorthand lambda notation where you can replace function(x)
with \(x)
.
runif(n = 1, min = 0, max = 10) |>
rnorm(n = 10, mean = x, sd = 1))()
(\(x) 1] 8.1665 7.6003 8.5623 8.1483 7.8117 6.9847 7.7760 8.5618 9.0014 7.1214 [
I generally find the pipebind operator a lot more intuitive and straightforward than the lambda function option, which seems to needlessly multiply brackets and complexity. It is something you often see used with the base R pipe however, and in some cases it might well allow more efficient code.
13.2.3 Bootstrap CI function as a pipeline
One of the things e did in the last chapter was to develop a function to calculate the bootstrap confidence intervals for the mean of a vector of numbers. The code for this takes the vector of numbers, repeatedly samples from it to generate 1000 new sets of data sampled from the original, calculates the mean of each one and then generates the confidence intervals from the quantiles of the distribution of new means. Here it is.
<- function(X1, conf = 95) {
boot.conf
<-
bootstraps replicate(1000, sample(X1,
size = length(X1),
replace = TRUE))
<- apply(bootstraps, 2, mean)
bootstrap.means
<-
output c(quantile(bootstrap.means, (100 - conf) / 200),
quantile(bootstrap.means, 1 - (100 - conf) / 200))
output }
This can be done rather nicely with a pipeline, as follows:
# Set up function
<-function(X1,conf=95) {
boot_conf
# Generate 1000 bootstrap samples
replicate(1000,
sample(X1,
size = length(X1),
replace = TRUE)) |>
# Calculate means
apply( MARGIN = 2, FUN = mean) |>
# Generate CIs from the quantiles of the means
c(quantile(x, (100 - conf) / 200),
(\(x) quantile(x, 1 - (100 - conf) / 200)))()
}
Using the new base R pipe syntax we firstly generate the 1000 replicated samples using the replicate()
and sample()
functions. This generates a matrix of numbers, with each column being a separate bootstrap replicate. We then pipe this matrix to the apply()
function, which carries out operations on each row or column of a matrix. Because the first argument to apply()
is our matrix we can just pipe it straight in. We set the MARGIN
argument to 2, so our operation is done on each column, and we set FUN
to mean, so apply()
knows to calculate the mean of each column. This generates a vector of means, and we pipe that vector into a short lambda function which calculates the upper and lower bootstrap confidence intervals for us. Does it work?
# Generate 100 numbers drawn from a normal distribution
# with mean 32 and standard deviation 3
<- rnorm(n = 100, mean = 32, sd = 3)
test
# Calculate 66 and 95% bootstrap CIs
boot_conf(test, conf = 66)
17% 83%
31.396 31.955
boot_conf(test, conf = 95)
2.5% 97.5%
31.160 32.274
13.3 Which one to use?
Should you use the magrittr
pipe or the base R one? This is a good question. They both give you the same functionality and let you write clearer, easier to understand code and avoid the horrors of six-level nested brackets. At the moment you’ll see the magrittr
pipe being used a lot more simply because it’s been around a lot and all the dplyr
etc. literature uses it. That may well change in future however. While the easy placeholder functionality of magrittr
seems more straightforward than the base R options, the base R options are a bit more flexible and avoid some potential issues with the choice of a full stop .
as the placeholder in magrittr
. Moreover, using the base R pipe means that you are much less likely to have compatibility issues in future. The fewer packages you are loading means the lower the likelihood that when you try to run your code in two years time it will all go pear shaped because one of the packages has changed since you wrote it. Base R operators are going to stick around for a long time.
Finally, the base R pipes run a bit faster than magrittr pipes. This is because magrittr is doing quite a bit in the background, whereas the base R pipe is syntactically the same as the unpiped code, so mean(X1)
and X1 |> mean()
cause R to do the exact same things. This is unlikely to make a difference to most people, but if you’re (for example) running big simulations or other very resource heavy code it’s worth bearing in mind.
13.4 RStudio shortcuts
RStudio has a built in keyboard shortcut for the pipe symbol. On a mac this is command-shift-m, and on a PC it is control-shift-m. The default is for this to give the magrittr pipe operator %>%
but in the latest RStudio releases it can be changed to the base R pipe |>
by opening preferences, selecting the code tab and checking the box next to “Use native pipe operator, |>.”