7.4 Combining Select & Filter

The pipe is really helpful when combined with the data-manipulation of dplyr. Remember how we used filter to select only the black cats? What if we only want to see the ID’s of those cats, rather than all the info about them? We’ve already seen we can use select to pick out certain columns. We can use that to select the cat_id column from our filtered dataset like so

# reading from the inside out
select(filter(cats, coat == "black"), cat_id)

That might not look too bad now, but what if we wanted to do another operation on that output? We’d add another layer of nesting, and having to read that line from the inside-out can quickly become annoying. We can use the pipe operator to clean that up.

# reading from left to right
filter(cats, coat == "black") %>% select(cat_id)

We could even add another pipe to feed cats into filter; it isn’t necessary, but it makes it even easier to see what we’re operating on in this chain of commands. We’ll combine this with some line breaks to really make this easy to read:

cats %>%
  filter(coat == "black") %>%
  select(cat_id)

7.4.1 summarize

While mutate creates new columns, it’s often useful to summarize multiple rows into a single value. Say we want to find the mean weight of all these cats; enter summarize! Like mutate, the arguments to summarize (after the data.frame we want to operate on) are expressions. We can combine summarize with the mean function to get a mean weight for our collection of cats like so:

cats %>% summarize(mean_weight = mean(weight))

Notice how we have only a single value returned, but it’s still in a data.frame format. This is subtle, but important; all these basic dplyr verbs take in data.frames and also return data.frames. This consistency helps make long chains of dplyr operations possible.

7.4.2 group_by

A very common data analysis task is to do operations like we did above, but to do them on a group-by-group basis. To do this with dplyr, we’ll use the group_by function.

Let’s look at the mean weights of our cats, grouping up by coat. This will give us the mean weight of the black cats, mean weight of the calico cats, etc. We can do this by inserting a group_by function into our earlier expression for computing mean weight:

cats %>%
  group_by(coat) %>%
  summarize(mean_weight = mean(weight))

Ta-da!

We can also use mutate on a per-group basis. Let’s make a new column which centers our weights around zero; this can be done by subtracting the group’s mean weight from each cat’s weight:

cats %>%
  group_by(coat) %>%
  mutate(centered_weight = weight - mean(weight))