Chapter 8 Choosing data: subscripts and subsetting

As we saw in the last section, R is set up to deal with data in groups (vectors, matrices and dataframes), and most of the time that you’re analysing real data you’ll be dealing with data in one of these forms. For some simpler analyses you might be happy just looking at all the data in a particular set of results, but a lot of the time you’ll end up thinking “but what if we exclude males from the analysis?” or “does the result hold if we leave out that animal that might have been sick?” or “do we still see the trade-off of we only include the plants that flowered?” All of these can be done quite easily in R, and your chief weapons for such purposes are subscripts, which we’ll deal with now, and the subset() command, which we’ll talk about once we’ve dealt with subscripts.

8.1 Subscripts

Every number in a vector can be identified by its place in the sequence, and every number in a matrix can be identified by its row and column numbers. You can use subscripts to find individual numbers or groups within data structures. They’re remarkably flexible and extremely useful.

Z  <-  rnorm(10, mean = 2, sd = 0.1)

This creates a vector called Z made up of 10 random numbers drawn from a normal distribution with mean 2 and standard deviation 0.1. NB: if you try to do this your numbers won’t be the same as mine, because they’re drawn randomly each time.

Z
 [1] 2.092434 1.969250 1.937359 1.937720 2.058071 2.062920 2.106710 2.088391
 [9] 2.062410 2.130024

If we want to find out what the fifth number in Z is we could just count along until we get there, and when R writes out a vector it helpfully puts a number at the start of each row which tells you where you are in the sequence. In this case we have seven numbers in the first row and the first one is numbered [1], then the first number in the second row is numbered [8]. Just counting along a sequence of numbers can obviously get very unwieldy when we have larger datasets, and the potential for error by the counter is high even with each row being numbered. Fortunately we can just ask what the fifth number is by using a subscript which at its simplest is just a number in square brackets after the name of our vector. R will go and look up whatever’s at the position in the vector that corresponds to the number and tell you what it is.

Z[5]
[1] 2.058071

Subscripts do not have to be single numbers. The subscript can be an object.

p  <-  c(2, 5, 7)

This sets up a new object (p) which is a vector containing three numbers. We can now use this object as a subscript to find out what the second, fifth and seventh numbers in Z are.

Z[p]
[1] 1.969250 2.058071 2.106710

The subscript can even be a function.

Z[seq(from = 1, to = 5, by = 2)]
[1] 2.092434 1.937359 2.058071

We know that seq(from = 1, to = 5, by = 2)) will return the numbers 1, 3 and 5 so here we are asking what the three numbers in Z that occupy those positions are.

The subscript can also ask for the numbers in the vector excluding those specified in the subscript. This is particularly useful if you have some sort of dodgy data point that you want to exclude.

Z[-2]
[1] 2.092434 1.937359 1.937720 2.058071 2.062920 2.106710 2.088391 2.062410
[9] 2.130024

This gives us all the numbers in Z except for the second one.

We can include logical expressions in our subscript.

Z[Z>1.95]
[1] 2.092434 1.969250 2.058071 2.062920 2.106710 2.088391 2.062410 2.130024

This returns all the numbers in Z that are greater than 1.95.

Z[Z<=2]
[1] 1.969250 1.937359 1.937720

This gives us all the numbers in Z that are less than or equal to 2.

You can use subscripts to find out useful things about your data. If you want to know how many numbers in Z are less than or equal to 2 you can combine some subscripting with the length() command.

length(Z[Z<=2])
[1] 3

You can calculate other statistics as well. If you want to know the arithmetic mean of the numbers in Z that are less than or equal to 2 you can use a subscript.

mean(Z[Z<=2])
[1] 1.948109

This approach will work with just about any function. To find out the standard deviation of the same set of numbers use this:

sd(Z[Z<=2])
[1] 0.01830887

and this gives the sum of the numbers in Z that are less than or equal to 2.

sum(Z[Z<=2])
[1] 5.844328

One thing to notice is that using subscripts gives you the values of the numbers that correspond to the criterion¹⁰ you put in the square brackets but doesn’t tell you where in the sequence they are. To do that we can use the function which(). To find out which numbers in Z are less than or equal to 2:

which(Z<=2)
[1] 2 3 4

If we wanted to, we could then use these numbers in a subscript. Here I’m setting up an object that’s a vector of these seven numbers.

less.than.2 <- which(Z<=2)

Now I can use this object as a subscript itself.

Z[less.than.2]
[1] 1.969250 1.937359 1.937720

The circle is complete. There is actually a serious point to this last part. There are often several different ways of doing the same thing in R. It is often the case that there’s an obvious “best way,” but that isn’t always the case: sometimes one way of doing something isn’t noticeably easier or better than another, or sometimes doing something one way is better in one situation and doing it another way is better in a different situation. If someone else is doing something differently to you it doesn’t necessarily mean that you are wrong: just check what they’re doing and have a quick think about which method is better for what you’re trying to do. If the answer is “my method,” or if it’s “I can’t see any benefit to using the other method” then stick with what you’re doing.

8.2 Boolean logic and more complex subscripting

We’ve already seen that we can use logical operators such as > or <= in subscripts. We can also combine these to be selective in multiple ways by using operators like & (and) and | (or). If, for example, we wanted to extract the values in Z which are >1.95 but > 2.1 we could do it like this:

Z[Z > 1.95 & Z < 2.1]
[1] 2.092434 1.969250 2.058071 2.062920 2.088391 2.062410

Notice that you have to give the name of the variable each time you put in a logical operator. If you just try this:

Z[Z > 1.95 & < 2.1]

Error: unexpected '<' in "Z[Z > 1.95 & <"

it doesn’t work.

8.3 Subscripts in matrices and data frames

Subscripts can also be used to get individual numbers, rows or columns from matrices and data frames in the same way as for vectors, except two numbers are needed to identify an individual cell in these two dimensional data structures. The first number is the row number and the second is the column number. Here’s another matrix.

mat4 <- matrix(data=seq(101, 112), nrow=3, ncol=4)
mat4
     [,1] [,2] [,3] [,4]
[1,]  101  104  107  110
[2,]  102  105  108  111
[3,]  103  106  109  112

To ask “What’s the number that’s in the third row and second column of mat2?” we put the row number first in the subscript, then a comma, then the column number. NB I always have to stop and think about this because I always think it should be column number then row number to make it like xy coordinates.

mat4[3, 2]
[1] 106

What are the numbers in the third row that are in the second, third and fourth columns?

mat4[3, c(2, 3, 4)]
[1] 106 109 112

To get hold of everything in a particular row, just put the row number followed by a comma and don’t put in a number for the column. For example, if you just want the first row of the matrix use a 1.

mat4[1, ]
[1] 101 104 107 110

Likewise, if you want to get hold of a whole column then leave the row number empty.

mat4[, 3]
[1] 107 108 109

This gives us the third column of the matrix

mat4[, 1]+mat4[, 3]
[1] 208 210 212

This adds the first column of the matrix to the third column.

8.4 Subset

The subset() function is useful when you want to extract part of a matrix or dataframe. It takes three main arguments, the first being the name of whatever you want to make a subset of, the second is a logical expression and the third tells R which columns you want to choose. It’s best to show this with an example. Here’s some data that were collected as part of an experiment looking at the effect of environmental temperature on leucocyte count in fish fry.

fish <- read.csv("Data/Counts.csv", header=T)

Let’s look at the whole dataset to start with.

fish
  Sex   Temp Weight Count
1   M  Hot    73.25   282
2   M  Hot    69.28   170
3   F  Hot    81.38   151
4   M  Hot    66.07   238
5   F Cold    83.32   136
6   F Cold    63.06   203
7   M Cold    78.48   312
8   M Cold    55.38   274

If we wanted to set up a second data frame containing only data from those fish that weighed 70mg or more, we can just specify the first two arguments.

fish2 <- subset(fish, Weight>=70)

fish
  Sex   Temp Weight Count
1   M  Hot    73.25   282
2   M  Hot    69.28   170
3   F  Hot    81.38   151
4   M  Hot    66.07   238
5   F Cold    83.32   136
6   F Cold    63.06   203
7   M Cold    78.48   312
8   M Cold    55.38   274

What if we wanted to extract only the data on weights and leucocyte counts for male fish? For this we use the third argument as well, “select.”

fish3 <- subset(fish, Sex=="M", select=c(Weight, Count))

fish3
  Weight Count
1  73.25   282
2  69.28   170
4  66.07   238
7  78.48   312
8  55.38   274

One thing to notice here is that when we are specifying male fish only in the second argument we use the double equals sign (==). This is what’s used in R when we’re using logical expressions. The “M” is in inverted commas because it’s character data. It’s easy to forget and use a single equals sign, or miss out the inverted commas. If you do the latter you’ll get an error message.

fish4 <- subset(fish, Sex==M, select=c(Weight, Count))

Error in eval(e, x, parent.frame()) : object 'M' not found

If you only put a single equals sign in, however, you won’t get an error message. R will ignore the logical expression but it will select the columns specified and your new object will have data from both male and female fish. This could lead to serious errors in your analysis, so always check.

fish4 <- subset(fish, Sex="M", select=c(Weight, Count))

See? No error message, but when you look at the output from this command you find that it hasn’t been executed in the way you might wish.

fish4
  Weight Count
1  73.25   282
2  69.28   170
3  81.38   151
4  66.07   238
5  83.32   136
6  63.06   203
7  78.48   312
8  55.38   274

subset() can also be used within other functions: if, for example, you only want to analyse part of a dataset but you don’t want to set up a whole new object. We’ll see some examples of this when we look at statistical model fitting in more detail.

8.5 Exercises

NB These exercises are now available as interactive learnr tutorials: contact the author for a copy

8.5.1 Subscripts and vectors

Create a vector called x1 containing the numbers 3.6, 3.2, 5.6, 4.9, 6.0, 3.7, 5.5, 4.4 and 4.7.
Use a subscript to find out the value of the 3rd number in vector x1
Use a subscript to find out the value of the numbers in vector x1 that aren’t in the 5th position
Add the 1st number in vector x1 to the 6th number in vector x1
Create a new vector called “In” which consists of the numbers 1 and 4
Use subscripts and the “In” vector to calculate the sum of the 1st and 4th numbers in x1
Calculate the sum of all the numbers in x1 that are less than 4.6
Calculate the mean of all the numbers in x1 that are greater than or equal to 5

8.5.2 Subscripts and matrices

Generate a matrix called mat1 with 3 rows and 3 columns, using the data from the x1 vector as above. Use the default options for the matrix() function so that the matrix is filled by column.
Multiply the second value in the first row of mat1 by the third value in the second row of mat1
Create a new vector called “V2” which consists of the numbers in the first row of mat1 added to the numbers in the second row of mat1
Create a new vector called “V3” which consists of the numbers in the second column of mat1 multiplied by the mean of the numbers in the second row of mat1
Create a new matrix called “mat3” which consists of the first row of mat1 as the first column and then the first row of mat2 as the second column. Don’t forget that you have to give matrix() a vector of data to fill the new matrix with so you’ll have to use the c() function to generate a new vector from the first and third rows of mat1. You can either do this first and create a new object or you can do it within the matrix() function call.

8.6 Answers to exercises

8.6.1 Subscripts and vectors

Create a vector called x1 containing the numbers 3.6, 3.2, 5.6, 4.9, 6.0, 3.7, 5.5, 4.4 and 4.7.

x1 <- c(3.6, 3.2, 5.6, 4.9, 6.0, 3.7, 5.5, 4.4, 4.7)

Use a subscript to find out the value of the 3rd number in vector x1

x1[3]
[1] 5.6

Use a subscript to find out the value of the numbers in vector x1 that aren’t in the 5th position

x1[-5]
[1] 3.6 3.2 5.6 4.9 3.7 5.5 4.4 4.7

Add the 1st number in vector x1 to the 6th number in vector x1

#Lots of options for this. Either:
x1[1]+x1[6]
[1] 7.3

#or use the sum() function
sum(x1[1], x1[6])
[1] 7.3

#or even do it with a new vector that you generate within the subscript
sum(x1[c(1,6)])
[1] 7.3

Create a new vector called “In” which consists of the numbers 1 and 4

In <- c(1,4)

Use subscripts and the “In” vector to calculate the sum of the 1st and 4th numbers in x1

sum(x1[In])
[1] 8.5

Calculate the sum of all the numbers in x1 that are less than 4.6

sum(x1[x1<4.6])
[1] 14.9

Calculate the mean of all the numbers in x1 that are greater than or equal to 5

mean(x1[x1>=5])
[1] 5.7

8.6.2 Subscripts and matrices

Generate a matrix called mat1 with 3 rows and 3 columns, using the data from the x1 vector as above. Use the default options for the matrix() function so that the matrix is filled by column.

mat1 <- matrix(data  = x1, nrow = 3, ncol = 3)

Multiply the second value in the first row of mat1 by the third value in the second row of mat1

mat1[1,2] * mat1[2,3]
[1] 21.56

Create a new vector called “V2” which consists of the numbers in the first row of mat1 added to the numbers in the second row of mat1

v2 <- mat1[1, ] + mat1[2, ]
v2       
[1]  6.8 10.9  9.9

Create a new vector called “V3” which consists of the numbers in the second column of mat1 multiplied by the mean of the numbers in the second row of mat1

v2 <- mat1[, 2] * mean(mat1[2, ])
v2
[1] 22.21333 27.20000 16.77333

Create a new matrix called “mat3” which consists of the first row of mat1 as the first column and then the third row of mat1 as the second column

# Either generate the new vector of data seperately:
v3 <- c(mat1[1, ], mat1[3, ])
mat3 <- matrix(v3, ncol = 2)        

# or do it all in one go within the matrix function.
# This is cleaner because you don't end up with a 
# superfluous object (V3) sitting around in your workspace
mat3 <- matrix(c(mat1[1, ], mat1[3, ]), ncol=2)


mat3       
     [,1] [,2]
[1,]  3.6  5.6
[2,]  4.9  3.7
[3,]  5.5  4.7

NB for students: this word is the singular of “criteria.”↩︎