7.4 Combining Select & Filter
The pipe is really helpful when combined with the data-manipulation of dplyr
. Remember how we used filter
to select only the black cats? What if we only want to see the ID’s of those cats, rather than all the info about them? We’ve already seen we can use select
to pick out certain columns. We can use that to select the cat_id
column from our filter
ed dataset like so
# reading from the inside out
select(filter(cats, coat == "black"), cat_id)
That might not look too bad now, but what if we wanted to do another operation on that output? We’d add another layer of nesting, and having to read that line from the inside-out can quickly become annoying. We can use the pipe operator to clean that up.
# reading from left to right
filter(cats, coat == "black") %>% select(cat_id)
We could even add another pipe to feed cats
into filter
; it isn’t necessary, but it makes it even easier to see what we’re operating on in this chain of commands. We’ll combine this with some line breaks to really make this easy to read:
cats %>%
filter(coat == "black") %>%
select(cat_id)
7.4.1 summarize
While mutate
creates new columns, it’s often useful to summarize multiple rows into a single value. Say we want to find the mean weight of all these cats; enter summarize
! Like mutate
, the arguments to summarize (after the data.frame
we want to operate on) are expressions. We can combine summarize
with the mean
function to get a mean weight for our collection of cats like so:
cats %>% summarize(mean_weight = mean(weight))
Notice how we have only a single value returned, but it’s still in a data.frame
format. This is subtle, but important; all these basic dplyr
verbs take in data.frame
s and also return data.frame
s. This consistency helps make long chains of dplyr
operations possible.
7.4.2 group_by
A very common data analysis task is to do operations like we did above, but to do them on a group-by-group basis. To do this with dplyr
, we’ll use the group_by
function.
Let’s look at the mean weights of our cats, grouping up by coat. This will give us the mean weight of the black cats, mean weight of the calico cats, etc. We can do this by inserting a group_by
function into our earlier expression for computing mean weight:
cats %>%
group_by(coat) %>%
summarize(mean_weight = mean(weight))
Ta-da!
We can also use mutate
on a per-group basis. Let’s make a new column which centers our weights around zero; this can be done by subtracting the group’s mean weight from each cat’s weight:
cats %>%
group_by(coat) %>%
mutate(centered_weight = weight - mean(weight))