You can use the following syntax to calculate summary statistics for all numeric variables in a data frame in R using functions from the dplyr package:
library(dplyr) library(tidyr) df %>% summarise(across(where(is.numeric), .fns = list(min = min, median = median, mean = mean, stdev = sd, q25 = ~quantile(., 0.25), q75 = ~quantile(., 0.75), max = max))) %>% pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value'))
The summarise() function comes from the dplyr package and is used to calculate summary statistics for variables.
The pivot_longer() function comes from the tidyr package and is used to format the output to make it easier to read.
This particular syntax calculates the following summary statistics for each numeric variable in a data frame:
- Minimum value
- Median value
- Mean value
- Standard deviation
- 25th percentile
- 75th percentile
- Maximum value
The following example shows how to use this function in practice.
Example: Calculate Summary Statistics in R Using dplyr
Suppose we have the following data frame in R that contains information about various basketball players:
#create data frame df <- data.frame(team=c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'), points=c(12, 15, 19, 14, 24, 25, 39, 34), assists=c(6, 8, 8, 9, 12, 6, 8, 10), rebounds=c(9, 9, 8, 10, 8, 4, 3, 3)) #view data frame df team points assists rebounds 1 A 12 6 9 2 A 15 8 9 3 A 19 8 8 4 A 14 9 10 5 B 24 12 8 6 B 25 6 4 7 B 39 8 3 8 B 34 10 3
We can use the following syntax to calculate summary statistics for each numeric variable in the data frame:
library(dplyr) library(tidyr) #calculate summary statistics for each numeric variable in data frame df %>% summarise(across(where(is.numeric), .fns = list(min = min, median = median, mean = mean, stdev = sd, q25 = ~quantile(., 0.25), q75 = ~quantile(., 0.75), max = max))) %>% pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value')) # A tibble: 3 x 8 variable min median mean stdev q25 q75 max 1 points 12 21.5 22.8 9.74 14.8 27.2 39 2 assists 6 8 8.38 2.00 7.5 9.25 12 3 rebounds 3 8 6.75 2.92 3.75 9 10
From the output we can see:
- The minimum value in the points column is 12.
- The median value in the points column is 21.5.
- The mean value in the points column is 22.8.
And so on.
Note: In this example, we utilized the dplyr across() function. You can find the complete documentation for this function here.
Additional Resources
The following tutorials explain how to perform other common functions using dplyr:
How to Summarise Data But Keep All Columns Using dplyr
How to Summarise Multiple Columns Using dplyr
How to Calculate Standard Deviation Using dplyr