Chapter 14 Using dplyr to manipulate data
You can get a lot done using the options for selecting data with subscripts etc. which we looked at earlier. For more sophisticated data handling, however, you are likely to want to use the dplyr package. This is one of the most popular “add on” packages for R, written by Hadley Wickham and part of the “tidyverse” group of packages. Together these give a great deal of extra functionality to the language and make many processes a lot more consistent and straightforward. dplyr
is a complex and powerful package, but we will run through the basics of how it works here and hopefully you’ll be able to get an idea of the basics.
To install dplyr simply run
install.packages("dplyr")
Alternatively you can use
install.packages("tidyverse")
if you would like to install the main packages that together make up the tidyverse. Once the package is installed you need to load it with library()
library(dplyr)
dplyr
gives you the use of a series of functions (often referred to as verbs which can be used to manipulate data, but it goes beyond that in two important ways. Firstly it provides the pipe symbol %>%
which we looked at in the last chapter, and secondly it allows you to store data in an extended version of a data frame called a tibble.
We’ll start by looking at some of these new functions, and then we’ll see how we can use the pipe symbol to link them together. As our example we’ll use a dataset of income, population size, CO_2_ production and fine particulate pollution (PM_2.5_ particles) for 186 countries across the world, as published by the World Bank in 2014, which we’ll load into a data frame called pollution
.
<- read.csv("https://raw.githubusercontent.com/rjknell/Introductory_R_3rd_Edition/master/Data/CO2_PM25.csv")
pollution
str(pollution)
'data.frame': 186 obs. of 9 variables:
$ Country_Name : chr "Afghanistan" "Angola" "Albania" "Andorra" ...
$ Country_Code : chr "AFG" "AGO" "ALB" "AND" ...
$ Region : chr "South Asia" "Sub-Saharan Africa" "Europe & Central Asia" "Europe & Central Asia" ...
$ Income_group : chr "Low income" "Lower middle income" "Upper middle income" "High income" ...
$ Population : int 33370794 26941779 2889104 79213 9214175 42669500 2912403 92562 23475686 8546356 ...
$ GNP_per_Cap : int 630 5010 4540 NA 44330 12350 4140 12730 65150 50370 ...
$ Population_growth: num 3.356 3.497 -0.207 -1.951 0.177 ...
$ CO2 : num 0.294 1.29 1.979 5.833 22.94 ...
$ PM25 : num 59 33 18.9 10.8 38 ...
Using str()
to give us the structure of the data frame, you can see that there’s a lot of information here. We have 186 countries, and for each we have the population (Population
), the population growth rate (Population_growth
), the Gross National Product per capita (GNP_per_Cap
). We have two measures of how polluting the country is: CO2
is CO_2_ production per capita, in tonnes, and PM25 is a measure of the the average exposure to PM_2.5_ particulate pollution for a person in the country. We also have two grouping variables which divide our 186 countries up into groups. The first is geographical — Region
and the second, Income_group
divides countries up by how rich or poor they are.
There are lots of questions we might want that would want us to manipulate this data set: what’s the average CO_2_ production for high income countries, or how does population growth differ between geographical regions. We might want to generate new variables based on the existing ones, for example we might want a grouping variable that divides nations into ones with population growth rates above and below a certain value. We might want to use more sophisticated comparisons, such as comparing high-income European and Central Asian nations with low income ones. All of these can be done easily using dplyr.
14.1 Common dplyr verbs
14.1.1 filter()
filter()
is a dplyr verb which allows you to select certain rows only of a data frame (or tibble). It works a lot like the subset()
function we used in the previous chapter but it is rather simpler and more straightforward. Let’s see an example. To show how this works, let’s use filter()
to select only the countries in our pollution
data set from the South Asian geographical area.
<- filter(pollution, Region == "South Asia")
p1
# What are the names of the countries in our new data frame?
$Country_Name
p11] "Afghanistan" "Bangladesh" "Bhutan" "India" "Sri Lanka"
[6] "Maldives" "Nepal" "Pakistan" [
Just as with subset()
, the function works on a data frame and returns a data frame. We can use any Boolean logic expression so if we want to generate a data frame with only the countries with populations less than or equal to one million people we would use this:
<- filter(pollution, Population <= 1000000)
p1
str(p1)
'data.frame': 33 obs. of 9 variables:
$ Country_Name : chr "Andorra" "Antigua and Barbuda" "Bahamas, The" "Belize" ...
$ Country_Code : chr "AND" "ATG" "BHS" "BLZ" ...
$ Region : chr "Europe & Central Asia" "Latin America & Caribbean" "Latin America & Caribbean" "Latin America & Caribbean" ...
$ Income_group : chr "High income" "High income" "High income" "Upper middle income" ...
$ Population : int 79213 92562 370633 353366 65139 284825 409769 719056 759390 518269 ...
$ GNP_per_Cap : int NA 12730 27720 4230 NA 15320 43140 2530 1480 3360 ...
$ Population_growth: num -1.951 1.136 0.939 2.189 0.212 ...
$ CO2 : num 5.83 5.74 6.52 1.4 8.84 ...
$ PM25 : num 10.8 19.7 18.5 24.4 12.8 ...
This gives us a data frame with the same variables as before but with only 33 countries instead of 186.
If we wanted a data frame containing only the European and Central Asian countries with populations of less than a million
<-
p1 filter(pollution, Population <= 1000000 &
== "Europe & Central Asia")
Region
str(p1)
'data.frame': 5 obs. of 9 variables:
$ Country_Name : chr "Andorra" "Greenland" "Iceland" "Luxembourg" ...
$ Country_Code : chr "AND" "GRL" "ISL" "LUX" ...
$ Region : chr "Europe & Central Asia" "Europe & Central Asia" "Europe & Central Asia" "Europe & Central Asia" ...
$ Income_group : chr "High income" "High income" "High income" "High income" ...
$ Population : int 79213 56295 327386 556319 621810
$ GNP_per_Cap : int NA NA 48230 76470 7320
$ Population_growth: num -1.951 -0.333 1.113 2.357 0.097
$ CO2 : num 5.83 8.99 6.06 17.36 3.56
$ PM25 : num 10.83 12.51 6.95 10.86 21.95
Overall then filter()
can be seen to be very similar to subset()
. One difference is that if you give filter()
two arguments which are both logical expressions it will interpret this as an &
. This can make complex expressions a little more readable, so this:
<-
p1 filter(pollution, (Population <= 1000000 | Population_growth <=0) &
== "Europe & Central Asia") Region
gives the exact same result as this:
<-
p1 filter(pollution, Population <= 1000000 | Population_growth <=0,
== "Europe & Central Asia") Region
So why use filter()
if it’s basically the same as subset()
? Good question and indeed, subset()
is a perfectly useful and valid way to generate subsets of data. filter()
can be a bit faster, but the main reason to use it is that you know it will be stable and play nicely with the other dplyr verbs.
14.1.2 arrange()
This function changes the order of the rows in a data frame, so if we want our countries in ascending order of population size we can generate a new data frame like this:
<- arrange(pollution, Population)
p1
# 1st six entries of our new data frame with country name and population only
head(p1[ , c(1,5)])
Country_Name Population1 Greenland 56295
2 Marshall Islands 57179
3 Bermuda 65139
4 Dominica 71085
5 Andorra 79213
6 Seychelles 91359
To do the same thing but in descending order we can use desc(Population)
.
<- arrange(pollution, desc(Population))
p1
# 1st six entries of our new data frame with country name and population only
head(p1[ , c(1,5)])
Country_Name Population1 China 1364270000
2 India 1295604184
3 United States 318386421
4 Indonesia 255129004
5 Brazil 202763735
6 Pakistan 195306825
If we wanted to do something similar in base R we would need to use something like this:
<- pollution[order(pollution$Population), ] p1
for the descending order and
<- pollution[order(pollution$Population, decreasing = TRUE), ] p1
for the ascending order. arrange()
is more straightforward and easier to understand. arrange()
will also take multiple arguments for the order of the new data frame, so this will rearrange our data frame so that it is first ordered by Region
and then within each region by increasing values of Population
.
<- arrange(pollution, Region, Population)
p1
# 1st six rows of the Country_Name, Region and Population variables
head(p1[ , c(1, 3, 5)])
Country_Name Region Population1 Marshall Islands East Asia & Pacific 57179
2 Tonga East Asia & Pacific 101028
3 Micronesia, Fed. Sts. East Asia & Pacific 107446
4 Kiribati East Asia & Pacific 109391
5 Samoa East Asia & Pacific 192221
6 Vanuatu East Asia & Pacific 263888
14.1.3 select()
select()
lets you generate a new data frame made up of a selected subset of the variables present in your original one. This might not sound particularly thrilling — you can already do this in base R quite easily — but select()
allows you to do this very easily and in lots of different ways.
As an example, let’s say that you wanted to generate a new data frame from our pollution
data frame which only had the Population
, CO2
and PM25
variables. You can do this in base R using the data.frame()
function.
<- data.frame(pollution$Population,
p1 $CO2,
pollution$PM25)
pollution
head(p1)
pollution.Population pollution.CO2 pollution.PM251 33370794 0.29395 59.010
2 26941779 1.29031 32.974
3 2889104 1.97876 18.884
4 79213 5.83291 10.830
5 9214175 22.93961 37.983
6 42669500 4.78151 14.467
Using select you don’t need to name the data frame for each variable, so it is more concise.
<- select(pollution, Population, CO2, PM25)
p1
head(p1)
Population CO2 PM251 33370794 0.29395 59.010
2 26941779 1.29031 32.974
3 2889104 1.97876 18.884
4 79213 5.83291 10.830
5 9214175 22.93961 37.983
6 42669500 4.78151 14.467
select()
can also take column numbers instead of variable names.
<- select(pollution, c(5,7,8))
p1
names(p1)
1] "Population" "Population_growth" "CO2" [
In addition to the basic functionality we’ve already seen select()
has a suite of advanced features which let you choose variables in a wide variety of different ways. A lot of these features involve the use of built in “helper functions” which you can use to make your selection. There are a lot of these but here are some of the common ones.
You can use starts_with()
, ends_with()
or contains()
as arguments to choose variables with names that have particular character strings. This chooses all the variables beginning with “Pop.”
<- select(pollution, starts_with("Pop"))
p1 names(p1)
1] "Population" "Population_growth" [
This example selects all the variables that contain the number “2” as part of their name.
<- select(pollution, contains("2"))
p1
names(p1)
1] "CO2" "PM25" [
You can select a range of variables using a colon. This example will give us all the variables from Country_name
on the left to Population
on the right.
<- select(pollution, Country_Name : Population)
p1
names(p1)
1] "Country_Name" "Country_Code" "Region" "Income_group" "Population" [
You can use two logical operators: &
and |
, in your selection. This example selects all the variables that begin with either “Pop” or “Country.”
<- select(pollution, starts_with("Pop") | starts_with("Country"))
p1
names(p1)
1] "Population" "Population_growth" "Country_Name"
[4] "Country_Code" [
You can use !()
to choose all the variables apart from the ones specified. In this case we will select all the variables with names that do not begin with ’Country”.
<- select(pollution, !(starts_with("Country")))
p1
names(p1)
1] "Region" "Income_group" "Population"
[4] "GNP_per_Cap" "Population_growth" "CO2"
[7] "PM25" [
If you want to select all the variables that meet a certain criterion, such as being numeric, you can use the where()
helper function. This example chooses all the numeric variables in our pollution
data frame.
<- select(pollution, where(is.numeric))
p1
names(p1)
1] "Population" "GNP_per_Cap" "Population_growth"
[4] "CO2" "PM25" [
I’m sure you can see that this allows some extremely powerful and concise code to allow you to choose specific variables. In this era of big data it’s common to generate very large data frames with hundreds or even thousands of variables and select()
really comes into its own when reducing these down to more manageable sizes.
14.1.4 mutate()
mutate()
adds new variables to your data frame, and deletes variables that you don’t want. As an example, the CO2
variable in our pollution
data frame is the annual per-capita CO_2_ production for each contry. If we wanted the total CO_2_ output for each country we could calculate this by multiplying CO2
by Population
.
<- mutate(pollution, Total_CO2 = Population * CO2)
p1
names(p1)
1] "Country_Name" "Country_Code" "Region"
[4] "Income_group" "Population" "GNP_per_Cap"
[7] "Population_growth" "CO2" "PM25"
[10] "Total_CO2" [
If you give a new variable the same name as an existing one mutate()
will overwrite the original.
<- mutate(pollution, CO2 = Population * CO2)
p1
mean(pollution$CO2)
1] 4.3225
[mean(p1$CO2)
1] 181038316 [
To remove a variable you can use the variable = NULL
syntax in mutate()
. This example generates a new variable for total CO_2_ output and also removes the Region
variable from our mutated data frame.
<- mutate(pollution, Total_CO2 = Population * CO2, Region = NULL)
p1
names(p1)
1] "Country_Name" "Country_Code" "Income_group"
[4] "Population" "GNP_per_Cap" "Population_growth"
[7] "CO2" "PM25" "Total_CO2" [
14.2 Pipelines in dplyr
we introduced pipes and the use of pipelines to make code clearer in the last chapter. One instabce when this approach really comes into its own is when we use our dplyr verbs with pipes to produce chained commands. As an example, let’s say you want to extract the data for countries from the South Asian region only, calculate the total CO_2_ production, order them in descending order of total CO_2_ production and then display only the Country names and total CO_2_ production figure. You could do this with a series of individual function calls, like this.
# Choose region
<- filter(pollution, Region == "South Asia")
p1
# Calculate total CO2 production
<- mutate(p1, Total_CO2 = Population * CO2)
p2
# Order data
<- arrange(p2, desc(Total_CO2))
p3
# Choose country name and total CO2 variables
<- select(p3, Country_Name, Total_CO2)
p4
# print data to screen
p4
Country_Name Total_CO21 India 2238377137
2 Pakistan 166298450
3 Bangladesh 73189653
4 Sri Lanka 18393672
5 Afghanistan 9809225
6 Nepal 8030730
7 Maldives 1334788
8 Bhutan 1001091
Alternatively, you could nest the functions so as to do this with a single, very complicated piece of code.
<-select(arrange(mutate(filter(pollution,
p1 == "South Asia"),
Region Total_CO2 = Population * CO2),
desc(Total_CO2)),
Country_Name, Total_CO2)
This is a rather extreme example, but you can see that trying to work out what’s going on here would be rather difficult. Let’s not do that sort of thing.
Using pipes gives us a completely different looking piece of code. Here we’re using the magrittr
pipe which is included in dplyr
. Writing our code this way with pipes produces something almost like a sentence which you can read, hence the use of “verbs” to describe dplyr functions.
%>% filter(Region == "South Asia") %>%
pollution mutate(Total_CO2 = Population * CO2) %>%
arrange(desc(Total_CO2)) %>%
select(Country_Name, Total_CO2)
Country_Name Total_CO21 India 2238377137
2 Pakistan 166298450
3 Bangladesh 73189653
4 Sri Lanka 18393672
5 Afghanistan 9809225
6 Nepal 8030730
7 Maldives 1334788
8 Bhutan 1001091
Using the new base R pipe gives the exact same output.
|> filter(Region == "South Asia") |>
pollution mutate(Total_CO2 = Population * CO2) |>
arrange(desc(Total_CO2)) |>
select(Country_Name, Total_CO2)
Country_Name Total_CO21 India 2238377137
2 Pakistan 166298450
3 Bangladesh 73189653
4 Sri Lanka 18393672
5 Afghanistan 9809225
6 Nepal 8030730
7 Maldives 1334788
8 Bhutan 1001091
We’ll use the base R pipes for the rest of these examples but if you prefer the magrittr
ones they will generate identical outputs.
14.3 group_by()
and summarise()
.
Once you’ve got your dplyr pipelines going, there are two more dplyr functions which multiply the power of the package again. Firstly group_by()
, which does what it says on the tin: it allows you to group observations in your data frame according to the values of one or more variables in the data frame. Secondly summarise()
allows you to calculate summary statistics for your data frame. If used with group_by()
summarise allows you to calculate summary statistics for each group in the data frame.
As an example, let’s use our pollution
data frame to calculate the mean PM_2.5_ exposure for each region.
|>
pollution group_by(Region) |>
summarise(meanPM25 = mean(PM25))
# A tibble: 7 × 2
Region meanPM25<chr> <dbl>
1 East Asia & Pacific 20.6
2 Europe & Central Asia 17.5
3 Latin America & Caribbean 20.3
4 Middle East & North Africa 43.2
5 North America 9.49
6 South Asia 56.5
7 Sub-Saharan Africa 32.9
More sophisticated analyses will often work well in summarise()
. Here we’re using it to give us the number of observations for each region, then calculating the correlation coefficient for the relationship between PM_2.5_ exposure and CO_2_ output per capita for each region and finally producing a p-value for each correlation coefficient.
|> group_by(Region) |>
pollution summarise(
n = n(),
cor = cor(PM25, CO2,),
p.value = cor.test(PM25, CO2)$p.value
)# A tibble: 7 × 4
Region n cor p.value<chr> <int> <dbl> <dbl>
1 East Asia & Pacific 27 -0.122 0.545
2 Europe & Central Asia 50 -0.420 0.00242
3 Latin America & Caribbean 32 0.157 0.391
4 Middle East & North Africa 19 0.352 0.140
5 North America 3 -0.955 0.192
6 South Asia 8 -0.574 0.137
7 Sub-Saharan Africa 47 -0.182 0.221
There are a couple of things here that are worth pointing out. Firstly we’ve used the n()
dplyr helper function to get the sample size for each group. We could just as easily have used length()
with one of the variable names as an argument (e.g. n = length(PM25)
) but n()
is concise and clear. Secondly the code to extract the p-value might be a bit confusing. If you’re not familiar with statistical testing and correlations maybe just leave it for the moment (the chapter on Correlation Analysis later in the book covers the important points). If you are, we’ve used cor.test()
, which carries out a correlation analysis and does a statistical test to determine whether the value of r is significantly different from zero. To get just the p-value we’ve used cor.test(PM25, CO2)$p.value
. This is because cor.test()
produces an output which is a list of items which are generated during the analysis, including statistic
which is the t-statistic used in the hypothesis test and estimate
which is the estimated correlation coefficient. Using the $p.value
notation allows us to get just the p-value from this list.