install.packages("dplyr")
Grouped Summaries with dplyr in R
Introduction
One of the most common tasks in data analysis is to summarize data by groups. For example, you might want to calculate the average horsepower of cars by the number of cylinders, or the median height of Star Wars characters by their species. In this post, I will show you how to use the dplyr package in R to create summaries per group in a simple and elegant way.
dplyr
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.
Some of the advantages of using dplyr compared to base code are:
- It is fast and efficient
- It is intuitive and expressive, as it follows a clear and consistent syntax and avoids using $ or [] when manipulating data
- dplyr allows you to chain multiple transformations, making for more elegant code.
How to install and load dplyr
If you haven’t used dplyr before, you will need to install it first. Assuming you already have R installed, this can be done by running the following in the console:
Once installed, dplyr can be loaded simply by adding the following to your code:
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
The Challenge
Suppose we are dealing with a dataset which contains categorical or ordinal grouping:
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Now suppose we want to know the average horsepower (hp) for each group of cars based on their total cylinders (cyl)
Depending on what the situation calls for, this can be done in one of 2 ways:
- We can group the data by cylinder, and generate a summary
- We can group the data by cylinder, and generate a summary, but maintain the original data structure
How to create summaries per group using dplyr
The main function to create summaries per group using dplyr is summarise() (or summarize()). This function takes a data frame or a grouped data frame as input, and returns a new data frame with one row for each group and one column for each summary statistic.
To group a data frame by one or more variables, you can use the group_by() function. This function creates a special object called a grouped data frame, which preserves the grouping information for subsequent operations.
Let’s see an example using the built-in mtcars data set, which contains information about 32 cars. Suppose we want to calculate the average horsepower (hp) of cars by the number of cylinders (cyl). We can do this using dplyr as follows:
%>%
mtcars group_by(cyl) %>%
summarise(avg_hp = mean(hp))
# A tibble: 3 × 2
cyl avg_hp
<dbl> <dbl>
1 4 82.6
2 6 122.
3 8 209.
We can see that cars with 4 cylinders have an average horsepower of 82.6, cars with 6 cylinders have an average horsepower of 122, and cars with 8 cylinders have an average horsepower of 209.
We can also group by more than one variable, by passing multiple arguments to group_by(). For example, suppose we want to calculate the average horsepower of cars by the number of cylinders and the type of transmission (am, where 0 = automatic and 1 = manual). We can do this using dplyr as follows:
%>%
mtcars group_by(cyl, am) %>%
summarise(avg_hp = mean(hp))
The output is a new data frame with three columns: cyl, am, and avg_hp:
# A tibble: 6 × 3
# Groups: cyl [3]
cyl am avg_hp
<dbl> <dbl> <dbl>
1 4 0 84.7
2 4 1 81.9
3 6 0 115.
4 6 1 132.
5 8 0 194.
6 8 1 300.
How to create summaries per group using dplyr, and retain the original data structure
The mutate() function is another dplyr verb that allows you to create new variables or modify existing ones based on existing variables. Unlike summarise(), which reduces multiple values to a single summary, mutate() preserves the original number of rows and adds new columns to the data frame.
One use case where you might want to use mutate() instead of summarise() is when you want to create a new variable that is a function of existing variables and keep the original data frame intact. For example, suppose we want to create a new variable called hp_per_cyl, which is the ratio of horsepower to cylinders for each car. We can do this using dplyr as follows:
%>%
mtcars mutate(hp_per_cyl = hp / cyl)
The output is a new data frame with 12 columns: the original 11 columns from mtcars plus the new column hp_per_cyl:
mpg cyl disp hp drat wt qsec vs am gear carb hp_per_cyl
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 18.33333
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 18.33333
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 23.25000
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 18.33333
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 21.87500
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 17.50000