Chapter 14 Using dplyr to manipulate data

You can get a lot done using the options for selecting data with subscripts etc. which we looked at earlier. For more sophisticated data handling, however, you are likely to want to use the dplyr package. This is one of the most popular “add on” packages for R, written by Hadley Wickham and part of the “tidyverse” group of packages. Together these give a great deal of extra functionality to the language and make many processes a lot more consistent and straightforward. dplyr is a complex and powerful package, but we will run through the basics of how it works here and hopefully you’ll be able to get an idea of the basics.

To install dplyr simply run

install.packages("dplyr")

Alternatively you can use

install.packages("tidyverse")

if you would like to install the main packages that together make up the tidyverse. Once the package is installed you need to load it with library()

library(dplyr)

dplyr gives you the use of a series of functions (often referred to as verbs which can be used to manipulate data, but it goes beyond that in two important ways. Firstly it provides the pipe symbol %>% which we looked at in the last chapter, and secondly it allows you to store data in an extended version of a data frame called a tibble.

We’ll start by looking at some of these new functions, and then we’ll see how we can use the pipe symbol to link them together. As our example we’ll use a dataset of income, population size, CO_2_ production and fine particulate pollution (PM_2.5_ particles) for 186 countries across the world, as published by the World Bank in 2014, which we’ll load into a data frame called pollution.

pollution <- read.csv("https://raw.githubusercontent.com/rjknell/Introductory_R_3rd_Edition/master/Data/CO2_PM25.csv")

str(pollution)
'data.frame':   186 obs. of  9 variables:
 $ Country_Name     : chr  "Afghanistan" "Angola" "Albania" "Andorra" ...
 $ Country_Code     : chr  "AFG" "AGO" "ALB" "AND" ...
 $ Region           : chr  "South Asia" "Sub-Saharan Africa" "Europe & Central Asia" "Europe & Central Asia" ...
 $ Income_group     : chr  "Low income" "Lower middle income" "Upper middle income" "High income" ...
 $ Population       : int  33370794 26941779 2889104 79213 9214175 42669500 2912403 92562 23475686 8546356 ...
 $ GNP_per_Cap      : int  630 5010 4540 NA 44330 12350 4140 12730 65150 50370 ...
 $ Population_growth: num  3.356 3.497 -0.207 -1.951 0.177 ...
 $ CO2              : num  0.294 1.29 1.979 5.833 22.94 ...
 $ PM25             : num  59 33 18.9 10.8 38 ...

Using str() to give us the structure of the data frame, you can see that there’s a lot of information here. We have 186 countries, and for each we have the population (Population), the population growth rate (Population_growth), the Gross National Product per capita (GNP_per_Cap). We have two measures of how polluting the country is: CO2 is CO_2_ production per capita, in tonnes, and PM25 is a measure of the the average exposure to PM_2.5_ particulate pollution for a person in the country. We also have two grouping variables which divide our 186 countries up into groups. The first is geographical — Region and the second, Income_group divides countries up by how rich or poor they are.

There are lots of questions we might want that would want us to manipulate this data set: what’s the average CO_2_ production for high income countries, or how does population growth differ between geographical regions. We might want to generate new variables based on the existing ones, for example we might want a grouping variable that divides nations into ones with population growth rates above and below a certain value. We might want to use more sophisticated comparisons, such as comparing high-income European and Central Asian nations with low income ones. All of these can be done easily using dplyr.

14.1 Common dplyr verbs

14.1.1 filter()

filter() is a dplyr verb which allows you to select certain rows only of a data frame (or tibble). It works a lot like the subset() function we used in the previous chapter but it is rather simpler and more straightforward. Let’s see an example. To show how this works, let’s use filter() to select only the countries in our pollution data set from the South Asian geographical area.

p1 <- filter(pollution, Region == "South Asia")

# What are the names of the countries in our new data frame?
p1$Country_Name
[1] "Afghanistan" "Bangladesh"  "Bhutan"      "India"       "Sri Lanka"  
[6] "Maldives"    "Nepal"       "Pakistan"   

Just as with subset(), the function works on a data frame and returns a data frame. We can use any Boolean logic expression so if we want to generate a data frame with only the countries with populations less than or equal to one million people we would use this:

p1 <- filter(pollution, Population <= 1000000)

str(p1)
'data.frame':   33 obs. of  9 variables:
 $ Country_Name     : chr  "Andorra" "Antigua and Barbuda" "Bahamas, The" "Belize" ...
 $ Country_Code     : chr  "AND" "ATG" "BHS" "BLZ" ...
 $ Region           : chr  "Europe & Central Asia" "Latin America & Caribbean" "Latin America & Caribbean" "Latin America & Caribbean" ...
 $ Income_group     : chr  "High income" "High income" "High income" "Upper middle income" ...
 $ Population       : int  79213 92562 370633 353366 65139 284825 409769 719056 759390 518269 ...
 $ GNP_per_Cap      : int  NA 12730 27720 4230 NA 15320 43140 2530 1480 3360 ...
 $ Population_growth: num  -1.951 1.136 0.939 2.189 0.212 ...
 $ CO2              : num  5.83 5.74 6.52 1.4 8.84 ...
 $ PM25             : num  10.8 19.7 18.5 24.4 12.8 ...

This gives us a data frame with the same variables as before but with only 33 countries instead of 186.

If we wanted a data frame containing only the European and Central Asian countries with populations of less than a million

p1 <-
  filter(pollution, Population <= 1000000 &
           Region == "Europe & Central Asia")

str(p1)
'data.frame':   5 obs. of  9 variables:
 $ Country_Name     : chr  "Andorra" "Greenland" "Iceland" "Luxembourg" ...
 $ Country_Code     : chr  "AND" "GRL" "ISL" "LUX" ...
 $ Region           : chr  "Europe & Central Asia" "Europe & Central Asia" "Europe & Central Asia" "Europe & Central Asia" ...
 $ Income_group     : chr  "High income" "High income" "High income" "High income" ...
 $ Population       : int  79213 56295 327386 556319 621810
 $ GNP_per_Cap      : int  NA NA 48230 76470 7320
 $ Population_growth: num  -1.951 -0.333 1.113 2.357 0.097
 $ CO2              : num  5.83 8.99 6.06 17.36 3.56
 $ PM25             : num  10.83 12.51 6.95 10.86 21.95

Overall then filter() can be seen to be very similar to subset(). One difference is that if you give filter() two arguments which are both logical expressions it will interpret this as an &. This can make complex expressions a little more readable, so this:

p1 <-
  filter(pollution, (Population <= 1000000 | Population_growth <=0) &
           Region == "Europe & Central Asia")

gives the exact same result as this:

p1 <-
  filter(pollution, Population <= 1000000 | Population_growth <=0,
           Region == "Europe & Central Asia")

So why use filter() if it’s basically the same as subset()? Good question and indeed, subset() is a perfectly useful and valid way to generate subsets of data. filter() can be a bit faster, but the main reason to use it is that you know it will be stable and play nicely with the other dplyr verbs.

14.1.2 arrange()

This function changes the order of the rows in a data frame, so if we want our countries in ascending order of population size we can generate a new data frame like this:

p1 <- arrange(pollution, Population)

# 1st six entries of our new data frame with country name and population only
head(p1[ , c(1,5)])
      Country_Name Population
1        Greenland      56295
2 Marshall Islands      57179
3          Bermuda      65139
4         Dominica      71085
5          Andorra      79213
6       Seychelles      91359

To do the same thing but in descending order we can use desc(Population).

p1 <- arrange(pollution, desc(Population))

# 1st six entries of our new data frame with country name and population only
head(p1[ , c(1,5)])
   Country_Name Population
1         China 1364270000
2         India 1295604184
3 United States  318386421
4     Indonesia  255129004
5        Brazil  202763735
6      Pakistan  195306825

If we wanted to do something similar in base R we would need to use something like this:

p1 <- pollution[order(pollution$Population), ]

for the descending order and

p1 <- pollution[order(pollution$Population, decreasing = TRUE), ]

for the ascending order. arrange() is more straightforward and easier to understand. arrange() will also take multiple arguments for the order of the new data frame, so this will rearrange our data frame so that it is first ordered by Region and then within each region by increasing values of Population.

p1 <- arrange(pollution, Region, Population)

# 1st six rows of the Country_Name, Region and Population variables
head(p1[ , c(1, 3, 5)])
           Country_Name              Region Population
1      Marshall Islands East Asia & Pacific      57179
2                 Tonga East Asia & Pacific     101028
3 Micronesia, Fed. Sts. East Asia & Pacific     107446
4              Kiribati East Asia & Pacific     109391
5                 Samoa East Asia & Pacific     192221
6               Vanuatu East Asia & Pacific     263888

14.1.3 select()

select() lets you generate a new data frame made up of a selected subset of the variables present in your original one. This might not sound particularly thrilling — you can already do this in base R quite easily — but select() allows you to do this very easily and in lots of different ways.

As an example, let’s say that you wanted to generate a new data frame from our pollution data frame which only had the Population, CO2 and PM25 variables. You can do this in base R using the data.frame() function.

p1 <- data.frame(pollution$Population,
                 pollution$CO2, 
                 pollution$PM25)

head(p1)
  pollution.Population pollution.CO2 pollution.PM25
1             33370794       0.29395         59.010
2             26941779       1.29031         32.974
3              2889104       1.97876         18.884
4                79213       5.83291         10.830
5              9214175      22.93961         37.983
6             42669500       4.78151         14.467

Using select you don’t need to name the data frame for each variable, so it is more concise.

p1 <- select(pollution, Population, CO2, PM25)

head(p1)
  Population      CO2   PM25
1   33370794  0.29395 59.010
2   26941779  1.29031 32.974
3    2889104  1.97876 18.884
4      79213  5.83291 10.830
5    9214175 22.93961 37.983
6   42669500  4.78151 14.467

select() can also take column numbers instead of variable names.

p1 <- select(pollution, c(5,7,8))

names(p1)
[1] "Population"        "Population_growth" "CO2"              

In addition to the basic functionality we’ve already seen select() has a suite of advanced features which let you choose variables in a wide variety of different ways. A lot of these features involve the use of built in “helper functions” which you can use to make your selection. There are a lot of these but here are some of the common ones.

You can use starts_with(), ends_with() or contains() as arguments to choose variables with names that have particular character strings. This chooses all the variables beginning with “Pop.”

p1 <- select(pollution, starts_with("Pop"))
names(p1)
[1] "Population"        "Population_growth"

This example selects all the variables that contain the number “2” as part of their name.

p1 <- select(pollution, contains("2"))

names(p1)
[1] "CO2"  "PM25"

You can select a range of variables using a colon. This example will give us all the variables from Country_name on the left to Population on the right.

p1 <- select(pollution, Country_Name : Population)

names(p1)
[1] "Country_Name" "Country_Code" "Region"       "Income_group" "Population"  

You can use two logical operators: & and |, in your selection. This example selects all the variables that begin with either “Pop” or “Country.”

p1 <- select(pollution, starts_with("Pop") | starts_with("Country"))

names(p1)
[1] "Population"        "Population_growth" "Country_Name"     
[4] "Country_Code"     

You can use !() to choose all the variables apart from the ones specified. In this case we will select all the variables with names that do not begin with ’Country”.

p1 <- select(pollution, !(starts_with("Country")))

names(p1)
[1] "Region"            "Income_group"      "Population"       
[4] "GNP_per_Cap"       "Population_growth" "CO2"              
[7] "PM25"             

If you want to select all the variables that meet a certain criterion, such as being numeric, you can use the where() helper function. This example chooses all the numeric variables in our pollution data frame.

p1 <- select(pollution, where(is.numeric))

names(p1)
[1] "Population"        "GNP_per_Cap"       "Population_growth"
[4] "CO2"               "PM25"             

I’m sure you can see that this allows some extremely powerful and concise code to allow you to choose specific variables. In this era of big data it’s common to generate very large data frames with hundreds or even thousands of variables and select() really comes into its own when reducing these down to more manageable sizes.

14.1.4 mutate()

mutate() adds new variables to your data frame, and deletes variables that you don’t want. As an example, the CO2 variable in our pollution data frame is the annual per-capita CO_2_ production for each contry. If we wanted the total CO_2_ output for each country we could calculate this by multiplying CO2 by Population.

p1 <- mutate(pollution, Total_CO2 = Population * CO2)

names(p1)
 [1] "Country_Name"      "Country_Code"      "Region"           
 [4] "Income_group"      "Population"        "GNP_per_Cap"      
 [7] "Population_growth" "CO2"               "PM25"             
[10] "Total_CO2"        

If you give a new variable the same name as an existing one mutate() will overwrite the original.

p1 <- mutate(pollution, CO2 = Population * CO2)

mean(pollution$CO2)
[1] 4.3225
mean(p1$CO2)
[1] 181038316

To remove a variable you can use the variable = NULL syntax in mutate(). This example generates a new variable for total CO_2_ output and also removes the Region variable from our mutated data frame.

p1 <- mutate(pollution, Total_CO2 = Population * CO2, Region = NULL)

names(p1)
[1] "Country_Name"      "Country_Code"      "Income_group"     
[4] "Population"        "GNP_per_Cap"       "Population_growth"
[7] "CO2"               "PM25"              "Total_CO2"        

14.2 Pipelines in dplyr

we introduced pipes and the use of pipelines to make code clearer in the last chapter. One instabce when this approach really comes into its own is when we use our dplyr verbs with pipes to produce chained commands. As an example, let’s say you want to extract the data for countries from the South Asian region only, calculate the total CO_2_ production, order them in descending order of total CO_2_ production and then display only the Country names and total CO_2_ production figure. You could do this with a series of individual function calls, like this.

# Choose region
p1 <- filter(pollution, Region == "South Asia")

# Calculate total CO2 production
p2 <- mutate(p1, Total_CO2 = Population * CO2)

# Order data
p3 <- arrange(p2, desc(Total_CO2))

# Choose country name and total CO2 variables
p4 <- select(p3, Country_Name, Total_CO2)

# print data to screen
p4
  Country_Name  Total_CO2
1        India 2238377137
2     Pakistan  166298450
3   Bangladesh   73189653
4    Sri Lanka   18393672
5  Afghanistan    9809225
6        Nepal    8030730
7     Maldives    1334788
8       Bhutan    1001091

Alternatively, you could nest the functions so as to do this with a single, very complicated piece of code.


p1 <-select(arrange(mutate(filter(pollution, 
                           Region == "South Asia"), 
                           Total_CO2 = Population * CO2),
                           desc(Total_CO2)), 
                           Country_Name, Total_CO2)

This is a rather extreme example, but you can see that trying to work out what’s going on here would be rather difficult. Let’s not do that sort of thing.

Using pipes gives us a completely different looking piece of code. Here we’re using the magrittr pipe which is included in dplyr. Writing our code this way with pipes produces something almost like a sentence which you can read, hence the use of “verbs” to describe dplyr functions.

pollution %>% filter(Region == "South Asia") %>% 
  mutate(Total_CO2 = Population * CO2) %>% 
  arrange(desc(Total_CO2)) %>% 
  select(Country_Name, Total_CO2)
  Country_Name  Total_CO2
1        India 2238377137
2     Pakistan  166298450
3   Bangladesh   73189653
4    Sri Lanka   18393672
5  Afghanistan    9809225
6        Nepal    8030730
7     Maldives    1334788
8       Bhutan    1001091

Using the new base R pipe gives the exact same output.

pollution |>  filter(Region == "South Asia") |>  
  mutate(Total_CO2 = Population * CO2) |>  
  arrange(desc(Total_CO2)) |> 
  select(Country_Name, Total_CO2)
  Country_Name  Total_CO2
1        India 2238377137
2     Pakistan  166298450
3   Bangladesh   73189653
4    Sri Lanka   18393672
5  Afghanistan    9809225
6        Nepal    8030730
7     Maldives    1334788
8       Bhutan    1001091

We’ll use the base R pipes for the rest of these examples but if you prefer the magrittr ones they will generate identical outputs.

14.3 group_by() and summarise().

Once you’ve got your dplyr pipelines going, there are two more dplyr functions which multiply the power of the package again. Firstly group_by(), which does what it says on the tin: it allows you to group observations in your data frame according to the values of one or more variables in the data frame. Secondly summarise() allows you to calculate summary statistics for your data frame. If used with group_by() summarise allows you to calculate summary statistics for each group in the data frame.

As an example, let’s use our pollution data frame to calculate the mean PM_2.5_ exposure for each region.

pollution |>   
  group_by(Region) |>  
  summarise(meanPM25 = mean(PM25))
# A tibble: 7 × 2
  Region                     meanPM25
  <chr>                         <dbl>
1 East Asia & Pacific           20.6 
2 Europe & Central Asia         17.5 
3 Latin America & Caribbean     20.3 
4 Middle East & North Africa    43.2 
5 North America                  9.49
6 South Asia                    56.5 
7 Sub-Saharan Africa            32.9 

More sophisticated analyses will often work well in summarise(). Here we’re using it to give us the number of observations for each region, then calculating the correlation coefficient for the relationship between PM_2.5_ exposure and CO_2_ output per capita for each region and finally producing a p-value for each correlation coefficient.

pollution |>   group_by(Region) |> 
   summarise(
    n = n(),
    cor = cor(PM25, CO2,),
    p.value = cor.test(PM25, CO2)$p.value
)
# A tibble: 7 × 4
  Region                         n    cor p.value
  <chr>                      <int>  <dbl>   <dbl>
1 East Asia & Pacific           27 -0.122 0.545  
2 Europe & Central Asia         50 -0.420 0.00242
3 Latin America & Caribbean     32  0.157 0.391  
4 Middle East & North Africa    19  0.352 0.140  
5 North America                  3 -0.955 0.192  
6 South Asia                     8 -0.574 0.137  
7 Sub-Saharan Africa            47 -0.182 0.221  

There are a couple of things here that are worth pointing out. Firstly we’ve used the n() dplyr helper function to get the sample size for each group. We could just as easily have used length() with one of the variable names as an argument (e.g. n = length(PM25)) but n() is concise and clear. Secondly the code to extract the p-value might be a bit confusing. If you’re not familiar with statistical testing and correlations maybe just leave it for the moment (the chapter on Correlation Analysis later in the book covers the important points). If you are, we’ve used cor.test(), which carries out a correlation analysis and does a statistical test to determine whether the value of r is significantly different from zero. To get just the p-value we’ve used cor.test(PM25, CO2)$p.value. This is because cor.test() produces an output which is a list of items which are generated during the analysis, including statistic which is the t-statistic used in the hypothesis test and estimate which is the estimated correlation coefficient. Using the $p.value notation allows us to get just the p-value from this list.