7.2 Subsetting Data
The first two dplyr
commands we’ll use help us to subset our data by rows and columns.
7.2.1 select
The first command we’ll use is select
, which allows us to choose columns from our dataset. Let’s use our cats
dataset and select only the coat
column; we did this previously with
cats[, "coat"]
With dplyr, we don’t need to enclose our column names in quotes
select(cats, coat)
Notice how the output differs slightly; all the main dplyr
verbs behave consistently in that their inputs and outputs are both data.frame
s, rather than returning a simple vector as the bracket-indexing method did. All of the main “verbs” we’ll talk about will return a data.frame as their result.
We can select more columns by giving select
additional arguments, and our output data.frame
will have columns according to the order of our arguments
select(cats, coat, cat_id)
7.2.2 filter
So where select
allowed us to select columns, filter
operated on rows. Say we want to see the all the cats with black coats; we saw earlier how to use that using bracket-indexing:
cats[cats$coat == "black", ]
In dplyr, this looks like
filter(cats, coat == "black")
Notice we don’t have to use the $
operator to tell filter
where the coat
column is; it’s smart enough to assume we want the coat
column from the data.frame
we passed in.
7.2.3 arrange
Maybe you have a set of observations in your data that you want to organize by their value. arrange
allows us to change the order of rows in our dataset based on their values.
arrange(cats, coat)
# you can include additional columns to help sort the data
arrange(cats, coat, sex)
7.2.4 mutate
One common task in working with data is updating/cleaning some of the values in columns. mutate
allows us to do this relatively easily. Let’s say I don’t want a lot of decimal places in one of my measurements. I can use mutate
to update my existing variable:
mutate(cats, weight = round(weight, 2))
Another common task is generating a new column based on values that are already in the dataset you are working on. mutate
helps us do this, and tacks a new column to the end of our data.frame.
# let's say you want to add two variables together
mutate(cats, new_variable = age + weight)
# you can include as many new variables as you want, separated by a comma
mutate(cats, new_var_1 = age + weight, new_var_2 = age * weight)