7.2 Subsetting Data

The first two dplyr commands we’ll use help us to subset our data by rows and columns.

7.2.1 select

The first command we’ll use is select, which allows us to choose columns from our dataset. Let’s use our cats dataset and select only the coat column; we did this previously with

cats[, "coat"]

With dplyr, we don’t need to enclose our column names in quotes

select(cats, coat)

Notice how the output differs slightly; all the main dplyr verbs behave consistently in that their inputs and outputs are both data.frames, rather than returning a simple vector as the bracket-indexing method did. All of the main “verbs” we’ll talk about will return a data.frame as their result.

We can select more columns by giving select additional arguments, and our output data.frame will have columns according to the order of our arguments

select(cats, coat, cat_id)

7.2.2 filter

So where select allowed us to select columns, filter operated on rows. Say we want to see the all the cats with black coats; we saw earlier how to use that using bracket-indexing:

cats[cats$coat == "black", ]

In dplyr, this looks like

filter(cats, coat == "black")

Notice we don’t have to use the $ operator to tell filter where the coat column is; it’s smart enough to assume we want the coat column from the data.frame we passed in.

7.2.3 arrange

Maybe you have a set of observations in your data that you want to organize by their value. arrange allows us to change the order of rows in our dataset based on their values.

arrange(cats, coat)

# you can include additional columns to help sort the data
arrange(cats, coat, sex)

7.2.4 mutate

One common task in working with data is updating/cleaning some of the values in columns. mutate allows us to do this relatively easily. Let’s say I don’t want a lot of decimal places in one of my measurements. I can use mutate to update my existing variable:

mutate(cats, weight = round(weight, 2))

Another common task is generating a new column based on values that are already in the dataset you are working on. mutate helps us do this, and tacks a new column to the end of our data.frame.

# let's say you want to add two variables together
mutate(cats, new_variable = age + weight)

# you can include as many new variables as you want, separated by a comma
mutate(cats, new_var_1 = age + weight, new_var_2 = age * weight)