Chapter 8 Choosing data: subscripts and subsetting
As we saw in the last section, R is set up to deal with data in groups (vectors, matrices and dataframes), and most of the time that you’re analysing real data you’ll be dealing with data in one of these forms. For some simpler analyses you might be happy just looking at all the data in a particular set of results, but a lot of the time you’ll end up thinking “but what if we exclude males from the analysis?” or “does the result hold if we leave out that animal that might have been sick?” or “do we still see the trade-off of we only include the plants that flowered?” All of these can be done quite easily in R, and your chief weapons for such purposes are subscripts, which we’ll deal with now, and the subset()
command, which we’ll talk about once we’ve dealt with subscripts.
8.1 Subscripts
Every number in a vector can be identified by its place in the sequence, and every number in a matrix can be identified by its row and column numbers. You can use subscripts to find individual numbers or groups within data structures. They’re remarkably flexible and extremely useful.
<- rnorm(10, mean = 2, sd = 0.1) Z
This creates a vector called Z made up of 10 random numbers drawn from a normal distribution with mean 2 and standard deviation 0.1. NB: if you try to do this your numbers won’t be the same as mine, because they’re drawn randomly each time.
Z1] 2.092434 1.969250 1.937359 1.937720 2.058071 2.062920 2.106710 2.088391
[9] 2.062410 2.130024 [
If we want to find out what the fifth number in Z is we could just count along until we get there, and when R writes out a vector it helpfully puts a number at the start of each row which tells you where you are in the sequence. In this case we have seven numbers in the first row and the first one is numbered [1], then the first number in the second row is numbered [8]. Just counting along a sequence of numbers can obviously get very unwieldy when we have larger datasets, and the potential for error by the counter is high even with each row being numbered. Fortunately we can just ask what the fifth number is by using a subscript which at its simplest is just a number in square brackets after the name of our vector. R will go and look up whatever’s at the position in the vector that corresponds to the number and tell you what it is.
5]
Z[1] 2.058071 [
Subscripts do not have to be single numbers. The subscript can be an object.
<- c(2, 5, 7) p
This sets up a new object (p
) which is a vector containing three numbers. We can now use this object as a subscript to find out what the second, fifth and seventh numbers in Z
are.
Z[p]1] 1.969250 2.058071 2.106710 [
The subscript can even be a function.
seq(from = 1, to = 5, by = 2)]
Z[1] 2.092434 1.937359 2.058071 [
We know that seq(from = 1, to = 5, by = 2)) will return the numbers 1, 3 and 5 so here we are asking what the three numbers in Z
that occupy those positions are.
The subscript can also ask for the numbers in the vector excluding those specified in the subscript. This is particularly useful if you have some sort of dodgy data point that you want to exclude.
-2]
Z[1] 2.092434 1.937359 1.937720 2.058071 2.062920 2.106710 2.088391 2.062410
[9] 2.130024 [
This gives us all the numbers in Z except for the second one.
We can include logical expressions in our subscript.
>1.95]
Z[Z1] 2.092434 1.969250 2.058071 2.062920 2.106710 2.088391 2.062410 2.130024 [
This returns all the numbers in Z that are greater than 1.95.
<=2]
Z[Z1] 1.969250 1.937359 1.937720 [
This gives us all the numbers in Z that are less than or equal to 2.
You can use subscripts to find out useful things about your data. If you want to know how many numbers in Z are less than or equal to 2 you can combine some subscripting with the length()
command.
length(Z[Z<=2])
1] 3 [
You can calculate other statistics as well. If you want to know the arithmetic mean of the numbers in Z that are less than or equal to 2 you can use a subscript.
mean(Z[Z<=2])
1] 1.948109 [
This approach will work with just about any function. To find out the standard deviation of the same set of numbers use this:
sd(Z[Z<=2])
1] 0.01830887 [
and this gives the sum of the numbers in Z that are less than or equal to 2.
sum(Z[Z<=2])
1] 5.844328 [
One thing to notice is that using subscripts gives you the values of the numbers that correspond to the criterion10 you put in the square brackets but doesn’t tell you where in the sequence they are. To do that we can use the function which()
. To find out which numbers in Z are less than or equal to 2:
which(Z<=2)
1] 2 3 4 [
If we wanted to, we could then use these numbers in a subscript. Here I’m setting up an object that’s a vector of these seven numbers.
.2 <- which(Z<=2) less.than
Now I can use this object as a subscript itself.
.2]
Z[less.than1] 1.969250 1.937359 1.937720 [
The circle is complete. There is actually a serious point to this last part. There are often several different ways of doing the same thing in R. It is often the case that there’s an obvious “best way,” but that isn’t always the case: sometimes one way of doing something isn’t noticeably easier or better than another, or sometimes doing something one way is better in one situation and doing it another way is better in a different situation. If someone else is doing something differently to you it doesn’t necessarily mean that you are wrong: just check what they’re doing and have a quick think about which method is better for what you’re trying to do. If the answer is “my method,” or if it’s “I can’t see any benefit to using the other method” then stick with what you’re doing.
8.2 Boolean logic and more complex subscripting
We’ve already seen that we can use logical operators such as > or <= in subscripts. We can also combine these to be selective in multiple ways by using operators like &
(and) and |
(or). If, for example, we wanted to extract the values in Z which are >1.95 but > 2.1 we could do it like this:
> 1.95 & Z < 2.1]
Z[Z 1] 2.092434 1.969250 2.058071 2.062920 2.088391 2.062410 [
Notice that you have to give the name of the variable each time you put in a logical operator. If you just try this:
> 1.95 & < 2.1] Z[Z
Error: unexpected '<' in "Z[Z > 1.95 & <"
it doesn’t work.
8.3 Subscripts in matrices and data frames
Subscripts can also be used to get individual numbers, rows or columns from matrices and data frames in the same way as for vectors, except two numbers are needed to identify an individual cell in these two dimensional data structures. The first number is the row number and the second is the column number. Here’s another matrix.
<- matrix(data=seq(101, 112), nrow=3, ncol=4)
mat4
mat41] [,2] [,3] [,4]
[,1,] 101 104 107 110
[2,] 102 105 108 111
[3,] 103 106 109 112 [
To ask “What’s the number that’s in the third row and second column of mat2?” we put the row number first in the subscript, then a comma, then the column number. NB I always have to stop and think about this because I always think it should be column number then row number to make it like xy coordinates.
3, 2]
mat4[1] 106 [
What are the numbers in the third row that are in the second, third and fourth columns?
3, c(2, 3, 4)]
mat4[1] 106 109 112 [
To get hold of everything in a particular row, just put the row number followed by a comma and don’t put in a number for the column. For example, if you just want the first row of the matrix use a 1.
1, ]
mat4[1] 101 104 107 110 [
Likewise, if you want to get hold of a whole column then leave the row number empty.
3]
mat4[, 1] 107 108 109 [
This gives us the third column of the matrix
1]+mat4[, 3]
mat4[, 1] 208 210 212 [
This adds the first column of the matrix to the third column.
8.4 Subset
The subset()
function is useful when you want to extract part of a matrix or dataframe. It takes three main arguments, the first being the name of whatever you want to make a subset of, the second is a logical expression and the third tells R which columns you want to choose. It’s best to show this with an example. Here’s some data that were collected as part of an experiment looking at the effect of environmental temperature on leucocyte count in fish fry.
<- read.csv("Data/Counts.csv", header=T) fish
Let’s look at the whole dataset to start with.
fish
Sex Temp Weight Count1 M Hot 73.25 282
2 M Hot 69.28 170
3 F Hot 81.38 151
4 M Hot 66.07 238
5 F Cold 83.32 136
6 F Cold 63.06 203
7 M Cold 78.48 312
8 M Cold 55.38 274
If we wanted to set up a second data frame containing only data from those fish that weighed 70mg or more, we can just specify the first two arguments.
<- subset(fish, Weight>=70)
fish2
fish
Sex Temp Weight Count1 M Hot 73.25 282
2 M Hot 69.28 170
3 F Hot 81.38 151
4 M Hot 66.07 238
5 F Cold 83.32 136
6 F Cold 63.06 203
7 M Cold 78.48 312
8 M Cold 55.38 274
What if we wanted to extract only the data on weights and leucocyte counts for male fish? For this we use the third argument as well, “select.”
<- subset(fish, Sex=="M", select=c(Weight, Count))
fish3
fish3
Weight Count1 73.25 282
2 69.28 170
4 66.07 238
7 78.48 312
8 55.38 274
One thing to notice here is that when we are specifying male fish only in the second argument we use the double equals sign (==). This is what’s used in R when we’re using logical expressions. The “M” is in inverted commas because it’s character data. It’s easy to forget and use a single equals sign, or miss out the inverted commas. If you do the latter you’ll get an error message.
<- subset(fish, Sex==M, select=c(Weight, Count)) fish4
Error in eval(e, x, parent.frame()) : object 'M' not found
If you only put a single equals sign in, however, you won’t get an error message. R will ignore the logical expression but it will select the columns specified and your new object will have data from both male and female fish. This could lead to serious errors in your analysis, so always check.
<- subset(fish, Sex="M", select=c(Weight, Count)) fish4
See? No error message, but when you look at the output from this command you find that it hasn’t been executed in the way you might wish.
fish4
Weight Count1 73.25 282
2 69.28 170
3 81.38 151
4 66.07 238
5 83.32 136
6 63.06 203
7 78.48 312
8 55.38 274
subset()
can also be used within other functions: if, for example, you only want to analyse part of a dataset but you don’t want to set up a whole new object. We’ll see some examples of this when we look at statistical model fitting in more detail.
8.5 Exercises
NB These exercises are now available as interactive learnr tutorials: contact the author for a copy
8.5.1 Subscripts and vectors
Create a vector called
x1
containing the numbers 3.6, 3.2, 5.6, 4.9, 6.0, 3.7, 5.5, 4.4 and 4.7.Use a subscript to find out the value of the 3rd number in vector x1
Use a subscript to find out the value of the numbers in vector x1 that aren’t in the 5th position
Add the 1st number in vector x1 to the 6th number in vector x1
Create a new vector called “In” which consists of the numbers 1 and 4
Use subscripts and the “In” vector to calculate the sum of the 1st and 4th numbers in x1
Calculate the sum of all the numbers in x1 that are less than 4.6
Calculate the mean of all the numbers in x1 that are greater than or equal to 5
8.5.2 Subscripts and matrices
Generate a matrix called mat1 with 3 rows and 3 columns, using the data from the x1 vector as above. Use the default options for the
matrix()
function so that the matrix is filled by column.Multiply the second value in the first row of mat1 by the third value in the second row of mat1
Create a new vector called “V2” which consists of the numbers in the first row of mat1 added to the numbers in the second row of mat1
Create a new vector called “V3” which consists of the numbers in the second column of mat1 multiplied by the mean of the numbers in the second row of mat1
Create a new matrix called “mat3” which consists of the first row of mat1 as the first column and then the first row of mat2 as the second column. Don’t forget that you have to give
matrix()
a vector of data to fill the new matrix with so you’ll have to use thec()
function to generate a new vector from the first and third rows of mat1. You can either do this first and create a new object or you can do it within thematrix()
function call.
8.6 Answers to exercises
8.6.1 Subscripts and vectors
- Create a vector called
x1
containing the numbers 3.6, 3.2, 5.6, 4.9, 6.0, 3.7, 5.5, 4.4 and 4.7.
<- c(3.6, 3.2, 5.6, 4.9, 6.0, 3.7, 5.5, 4.4, 4.7) x1
- Use a subscript to find out the value of the 3rd number in vector x1
3]
x1[1] 5.6 [
- Use a subscript to find out the value of the numbers in vector x1 that aren’t in the 5th position
-5]
x1[1] 3.6 3.2 5.6 4.9 3.7 5.5 4.4 4.7 [
- Add the 1st number in vector x1 to the 6th number in vector x1
#Lots of options for this. Either:
1]+x1[6]
x1[1] 7.3
[
#or use the sum() function
sum(x1[1], x1[6])
1] 7.3
[
#or even do it with a new vector that you generate within the subscript
sum(x1[c(1,6)])
1] 7.3 [
- Create a new vector called “In” which consists of the numbers 1 and 4
<- c(1,4) In
- Use subscripts and the “In” vector to calculate the sum of the 1st and 4th numbers in x1
sum(x1[In])
1] 8.5 [
- Calculate the sum of all the numbers in x1 that are less than 4.6
sum(x1[x1<4.6])
1] 14.9 [
- Calculate the mean of all the numbers in x1 that are greater than or equal to 5
mean(x1[x1>=5])
1] 5.7 [
8.6.2 Subscripts and matrices
- Generate a matrix called mat1 with 3 rows and 3 columns, using the data from the x1 vector as above. Use the default options for the
matrix()
function so that the matrix is filled by column.
<- matrix(data = x1, nrow = 3, ncol = 3) mat1
- Multiply the second value in the first row of mat1 by the third value in the second row of mat1
1,2] * mat1[2,3]
mat1[1] 21.56 [
- Create a new vector called “V2” which consists of the numbers in the first row of mat1 added to the numbers in the second row of mat1
<- mat1[1, ] + mat1[2, ]
v2
v2 1] 6.8 10.9 9.9 [
- Create a new vector called “V3” which consists of the numbers in the second column of mat1 multiplied by the mean of the numbers in the second row of mat1
<- mat1[, 2] * mean(mat1[2, ])
v2
v21] 22.21333 27.20000 16.77333 [
- Create a new matrix called “mat3” which consists of the first row of mat1 as the first column and then the third row of mat1 as the second column
# Either generate the new vector of data seperately:
<- c(mat1[1, ], mat1[3, ])
v3 <- matrix(v3, ncol = 2)
mat3
# or do it all in one go within the matrix function.
# This is cleaner because you don't end up with a
# superfluous object (V3) sitting around in your workspace
<- matrix(c(mat1[1, ], mat1[3, ]), ncol=2)
mat3
mat3 1] [,2]
[,1,] 3.6 5.6
[2,] 4.9 3.7
[3,] 5.5 4.7 [
NB for students: this word is the singular of “criteria.”↩︎