Group operations in R
One of the things I really missed when I moved from Stata to R was how easy it was to do group operations; that is, being able to apply summary statistics by levels of a variable or variables in a dataset. Fortunately for me, I just needed a bit of time exploring R’s massive number of functions in order to find that not only does R have as good functionality for producing statistics by groups, it offers far more flexibility and customisation than Stata. In this week’s blog post, I will describe 4 functions that have become my grouping workhorses in R.
The data
For this post, I will use Hosmer and Lemeshow’s birthwt
dataset in the MASS
package, which describes risk factors for low birthweight in 189 infants (see help(birthwt)
for more details about these data). In order to make the categories clearer when performing the group operations, I’ll convert all of the indicator variables into labelled factors.
data(birthwt, package="MASS")
# Turn indicator variables into labelled factors
birthwt$low <- factor(birthwt$low, labels = c("Above 2.5kgs", "Below 2.5kgs"))
birthwt$race <- factor(birthwt$race, labels = c("White", "Black", "Other"))
birthwt$smoke <- factor(birthwt$smoke, labels = c("Non-smoker", "Smoker"))
birthwt$ht <- factor(birthwt$ht, labels = c("No HT", "HT"))
birthwt$ui <- factor(birthwt$ui, labels = c("No UI", "UI"))
Now we’re ready to get started.
table and prop.table
One of the most basic group operations are frequency tables, which give us the count of the number of observations in a group. This can be achieved using the table
function, which we’ll use to look at the frequency of low
below.
table(birthwt$low)
##
## Above 2.5kgs Below 2.5kgs
## 130 59
table
also allows us to do a contingency table, which is the crosstab between two categorical variables. Let’s have a look at the contingency table of low
by smoke
, or the number of babies with low birthweight by their mother’s smoking status.
table(birthwt$low, birthwt$smoke)
##
## Non-smoker Smoker
## Above 2.5kgs 86 44
## Below 2.5kgs 29 30
Often, the proportion of observations is more useful than the raw counts. We can easily convert the values in table
to proportions by encasing it in the function prop.table
. As a default, prop.table
will give you the percentage of the total. However, by entering a second argument we can specify if we want the proportions calculated along the rows (1
) or the columns (2
) instead. Let’s generate the row proportions for our low
by smoke
contingency table:
prop.table(table(birthwt$low, birthwt$smoke), 1)
##
## Non-smoker Smoker
## Above 2.5kgs 0.6615385 0.3384615
## Below 2.5kgs 0.4915254 0.5084746
We can also join tables with the raw count and the proportion using cbind
. In the second example I have rounded off the results of prop.table
using the round
function to make the table a little neater.
cbind(table(birthwt$low, birthwt$smoke), prop.table(table(birthwt$low, birthwt$smoke), 1))
## Non-smoker Smoker Non-smoker Smoker
## Above 2.5kgs 86 44 0.6615385 0.3384615
## Below 2.5kgs 29 30 0.4915254 0.5084746
cbind(table(birthwt$low, birthwt$smoke), round(prop.table(table(birthwt$low, birthwt$smoke), 1), 2))
## Non-smoker Smoker Non-smoker Smoker
## Above 2.5kgs 86 44 0.66 0.34
## Below 2.5kgs 29 30 0.49 0.51
aggregate
table
is great for counts, but what if you want to do something else? aggregate
is a neat way of extending this functionality, allowing you apply a variety of different statistics by groups. Let’s calculate the mean birthweight by smoking status.
aggregate(bwt ~ smoke, data = birthwt, FUN = mean)
## smoke bwt
## 1 Non-smoker 3055.696
## 2 Smoker 2771.919
We can also group by multiple variables. Here I have added hypertension status (ht
) as an additional grouping variable.
aggregate(bwt ~ smoke + ht, data = birthwt, FUN = mean)
## smoke ht bwt
## 1 Non-smoker No HT 3090.444
## 2 Smoker No HT 2787.203
## 3 Non-smoker HT 2519.571
## 4 Smoker HT 2561.000
By using anonymous functions, you can also generate the exact statistic you apply to your data. Anonymous functions are functions that are executed “on the fly” within R, and don’t need the formal set up of being named and called separately. Here I have used an anonymous function to generate the 25th, 50th and 75th percentiles of birthweight by smoking status.
aggregate(bwt ~ smoke, data = birthwt, FUN = function(x) quantile(x, c(.25, .5, .75)))
## smoke bwt.25% bwt.50% bwt.75%
## 1 Non-smoker 2509.00 3100.00 3621.50
## 2 Smoker 2370.50 2775.50 3245.75
If you want to get fancy, you can also nest aggregate
functions within each other, similar to the way you’d use nested SQL queries. Let’s first try getting the mean birthweight by maternal age (age
) and smoking status, and then calculating the minimum of these birthweights by smoking status.
aggregate(bwt ~ smoke,
data = aggregate(bwt ~ age + smoke, data = birthwt, FUN = mean),
FUN = min)
## smoke bwt
## 1 Non-smoker 1887.5
## 2 Smoker 1135.0
You can see I’ve replaced the data
argument with an aggregate function which calculates the sample I want to run the second aggregate function over.
summaryBy
A couple of limitations of aggregate
are that 1) it cannot do group operations over multiple variables, and 2) it cannot calculate multiple summary statistics unless they can be included in the same anonymous function.
An alternative function that can do both of these things is summaryBy
in the doBy
package. For Stata users, this is pretty much the R equivalent of tabstat
but with a little more flexibility. Let’s load in the package, and then calculate the mean of birthweight and maternal age by smoking status:
library(doBy)
summaryBy(bwt + age ~ smoke, data = birthwt, FUN = mean)
## smoke bwt.mean age.mean
## 1 Non-smoker 3055.696 23.42609
## 2 Smoker 2771.919 22.94595
We can also calculate a range of statistics using summaryBy
. Let’s calculate the minimum, maximum and median of birthweight by smoking status:
summaryBy(bwt ~ smoke, data = birthwt, FUN = c(min, max, median))
## smoke bwt.min bwt.max bwt.median
## 1 Non-smoker 1021 4990 3100.0
## 2 Smoker 709 4238 2775.5
Finally, we can combine anonymous functions with regular functions in summaryBy
, which gives you a lot of power to get the custom summary statistics you want. Here we’ll combine min, max, median and the 25th and 75th percentiles:
summaryBy(bwt ~ smoke, data = birthwt,
FUN = c(min, max, median, function(x) quantile(x, c(.25, .75))))
## smoke bwt.FUN1 bwt.FUN2 bwt.FUN3 bwt.FUN4 bwt.FUN5
## 1 Non-smoker 1021 4990 3100.0 2509.0 3621.50
## 2 Smoker 709 4238 2775.5 2370.5 3245.75
You can see that an unfortunate side-effect of introducing anonymous functions into summaryBy
is that you lose the meaningful column headings you get when using built-in functions. However, you can easily correct this by adding the names of each of the functions to the argument fun.names
.
summaryBy(bwt ~ smoke, data = birthwt,
FUN = c(min, max, median, function(x) quantile(x, c(.25, .75))),
fun.names = c("min", "max", "median", "25%", "75%"))
## smoke bwt.min bwt.max bwt.median bwt.25% bwt.75%
## 1 Non-smoker 1021 4990 3100.0 2509.0 3621.50
## 2 Smoker 709 4238 2775.5 2370.5 3245.75
ddply
The final function for performing group operations is the ddply
command from the plyr
package. plyr
became well known a few years ago as a package that simplified data wrangling in R, and this reputation is definitely well-deserved.
Let’s start by running our usual mean birthweight by smoking status operation in ddply
. The first argument indicates the data to be used, the second is the grouping variable(s), the third tells ddply we are going to run a summary operation, and the final assigns the mean birthweight to a variable.
library(plyr)
ddply(birthwt, "smoke", summarise,
mean.bwt = mean(bwt))
## smoke mean.bwt
## 1 Non-smoker 3055.696
## 2 Smoker 2771.919
You can see we get the exact same results as when using aggregate
and summaryBy
above.
You might have noticed that allowing us to create a specific variable for the mean of birthweight gives us a lot more flexibility when using aggregate
or summaryBy
: instead of running every statistic for every variable we aggregate, we can apply specific statistics to specific variables. Let’s have a further look at this, by calculating the the mean of birthweight and the median of age by smoking status:
ddply(birthwt, "smoke", summarise,
mean.bwt = mean(bwt),
median.age = median(age))
## smoke mean.bwt median.age
## 1 Non-smoker 3055.696 23
## 2 Smoker 2771.919 22
This also gives us the even more flexibility to calculate custom functions that we have when using anonymous functions in aggregate
and summaryBy
. This time we will count the number of observations with a low birthweight using subsetting:
ddply(birthwt, "smoke", summarise,
sum.low = length(low[low == "Below 2.5kgs"]),
mean.bwt = mean(bwt),
median.age = median(age))
## smoke sum.low mean.bwt median.age
## 1 Non-smoker 29 3055.696 23
## 2 Smoker 30 2771.919 22
ddply
can also group by multiple variables. Here we will group by both smoke
and ht
:
ddply(birthwt, c("smoke", "ht"), summarise,
sum.low = length(low[low == "Below 2.5kgs"]),
mean.bwt = mean(bwt),
median.age = median(age))
## smoke ht sum.low mean.bwt median.age
## 1 Non-smoker No HT 25 3090.444 23
## 2 Non-smoker HT 4 2519.571 24
## 3 Smoker No HT 27 2787.203 22
## 4 Smoker HT 3 2561.000 21
Using grouping results with data.frames
In order to extend the utility of group operations even further, you can assign each of these outputs to a new data.frame and merge it with other data.frames. Let’s say we want to merge the mean of birthweight by smoking status into the original birthwt
data. For each of the three functions (aggregate
, summaryBy
and ddply
), we need to assign the output to a new object. Then (as it is already a data.frame), we simply use the merge
function to join them together.
Let’s try this out using ddply
. You can see that we merge the two data.frames on the grouping variable (smoke
) as it is the constant between the two data.frames.
agg.birthwt <- ddply(birthwt, "smoke", summarise,
mean.bwt = mean(bwt))
birthwt <- merge(birthwt, agg.birthwt, by = "smoke")
Let’s have a look to see if it worked:
head(birthwt[birthwt$smoke == "Non-smoker",])
## smoke low age lwt race ptl ht ui ftv bwt mean.bwt
## 1 Non-smoker Above 2.5kgs 19 182 Black 0 No HT UI 0 2523 3055.696
## 2 Non-smoker Above 2.5kgs 33 155 Other 0 No HT No UI 3 2551 3055.696
## 3 Non-smoker Above 2.5kgs 23 130 Black 0 No HT No UI 1 3062 3055.696
## 4 Non-smoker Above 2.5kgs 21 160 White 0 No HT No UI 0 3062 3055.696
## 5 Non-smoker Above 2.5kgs 23 123 Other 0 No HT No UI 0 3544 3055.696
## 6 Non-smoker Above 2.5kgs 21 124 Other 0 No HT No UI 0 2622 3055.696
head(birthwt[birthwt$smoke == "Smoker",])
## smoke low age lwt race ptl ht ui ftv bwt mean.bwt
## 116 Smoker Above 2.5kgs 22 130 White 0 No HT No UI 0 3132 2771.919
## 117 Smoker Above 2.5kgs 23 115 Other 0 No HT No UI 1 3331 2771.919
## 118 Smoker Above 2.5kgs 29 130 White 0 No HT No UI 2 3884 2771.919
## 119 Smoker Above 2.5kgs 26 133 Other 2 No HT No UI 0 3260 2771.919
## 120 Smoker Above 2.5kgs 18 90 White 0 No HT UI 0 3062 2771.919
## 121 Smoker Above 2.5kgs 27 124 White 0 No HT No UI 0 2922 2771.919
And that’s it! You can see that each of these group operations have their own strengths and benefits, ranging from when you need a quick and easy overview of one summary statistic by one variable (table
, aggregate
) to more complex and tailored aggregation of the data (summaryBy
, ddply
). I also hope this has helped demystify data screening and wrangling, which has a reputation for being a bit of a pain in R.