5.5 Indexing data.frame objects

Our cats data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers (i.e. [row, column]).

cats[1, 2]   # first element in the 2nd column of the data frame
cats[1, 6]   # first element in the 6th column
cats[1:3, 7] # first three elements in the 7th column
cats[3, ]    # the 3rd element for all columns
cats[, 7]    # the entire 7th column
head_meta <- cats[1:6, ] # Row 1-6 which is the same as head()

For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Are species names in column 5 or 7? oh, right… they are in column 6). In some cases, in which column the variable will be can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.

You can do operations on a particular column, by selecting it using the $ sign. In this case, the entire column is a vector. You can use names(cats) or colnames(cats) to remind yourself of the column names. For instance, to extract all the cats’ weight information from our dataset:

cats$weight

In some cases, you may way to select more than one column. You can do this using the square brackets, passing in a vector of the columns to select. Suppose we wanted weight and coat information:

cats[ , c("weight", "coat")]

You can even access columns by column name and select specific rows of interest. For example, if we wanted the weight and coat of just rows 4 through 7, we could do:

cats[4:7, c("weight", "coat")]

We can can also use logical statements to select and filter items from a data.frame. For example, to select all rows with black cats we could use the following statement

cats[cats$coat == "black", ]

let’s break this down a bit. The logical statement in the brackets returns a vector of TRUE and FALSE values.

cats$coat == "black"

These booleans allow us to select which records we want from our data.frame

Another way to do this is with the function which(). which() finds the indexes of records meeting a logical statement

which(cats$coat == "black")

So, we could also write

cats[which(cats$coat == "black"), ]

But that’s getting really long and ugly. R is already considered somewhat of an ugly duckling among programming languages, so no reason to play into the stereotype.

We can combine logical statements and index statements

cats[cats$coat == "black", c("coat", "weight")]

Finally, we can use &, the symbol for “and”, and |, the symbol for “or”, to make logical statements.

cats[cats$coat == "black" & cats$roamer == "no", ]

This statement selects all records with black cats that also like string