Linear regression tools in R

written in statistics, r, regression, programming tips

Choosing the right linear regression model for your data can be an overwhelming venture, especially when you have a large number of available predictors. Luckily R has a wide array of in-built and user-written tools to make this process easier. In this week’s blog post I will describe some of the tools I commonly use.

For illustration I will use the mtcars dataset. The first step as always is to load in the data.

rm(list = ls())
data(mtcars)

I will also do a little bit of data cleaning by creating labelled factor variables for the categorical variables.

mtcars$am.f <- as.factor(mtcars$am); levels(mtcars$am.f) <- c("Automatic", "Manual")
mtcars$cyl.f <- as.factor(mtcars$cyl); levels(mtcars$cyl.f) <- c("4 cyl", "6 cyl", "8 cyl")
mtcars$vs.f <- as.factor(mtcars$vs); levels(mtcars$vs.f) <- c("V engine", "Straight engine")
mtcars$gear.f <- as.factor(mtcars$gear); levels(mtcars$gear.f) <- c("3 gears", "4 gears", "5 gears")
mtcars$carb.f <- as.factor(mtcars$carb)

We’re now ready to go through our model building tools. Similarly to last week’s blog post, I want to examine how well transmission type (am) predicts a car’s miles per gallon (mpg), taking into account other covariates and their interactions as appropriate.

Normality, linearity and multicollinearity

The first step to building the model is checking whether the data meet the assumptions of linear regression. A really neat way to simultaneously check the normality of the outcome, the linearity of the relationships between the outcome and the predictors and the intercorrelations between the predictors is the ggpairs plot in the very handy GGally R package. Before we run the ggpairs plot, I’ll rearrange the dataframe so the columns are in a more useful order for scanning the plot (with mpg and am as the first columns).

library(ggplot2); library(GGally)
mtcars <- mtcars[ , c(1, 9, 2:8, 10:16)]
g = ggpairs(mtcars[ , 1:11], lower = list(continuous = "smooth", params = c(method = "loess")))
g

plot of chunk ggpairs_plot

Note that this plot doesn’t work with factor variables, and the only categorical variables that should be included are either binary or ordinal. We can see that mpg is roughly normal (albeit a little positively skewed), and that the continuous variables and ordinal variables have linear relationship with mpg. We can also see there are a few very high intercorrelations between the potential predictor variables, but it is a little hard to pick these out in the volume of information.

A quick and easy alternative for finding collinear pairs is using the spec.cor function written by Joshua Wiley. This handy little function allows you to set a cutoff correlation level (in this case, 0.8), and it will return all pairs that correlate at or above that level.

spec.cor <- function (dat, r, ...) { 
    x <- cor(dat, ...) 
    x[upper.tri(x, TRUE)] <- NA 
    i <- which(abs(x) >= r, arr.ind = TRUE) 
    data.frame(matrix(colnames(x)[as.vector(i)], ncol = 2), value = x[i]) 
} 

spec.cor(mtcars[ , 2:11], 0.8)
##     X1   X2      value
## 1 disp  cyl  0.9020329
## 2   hp  cyl  0.8324475
## 3   vs  cyl -0.8108118
## 4   wt disp  0.8879799

In this case we can see we have four collinear pairs of predictors. I’ve decided I will keep wt and discard disp, hp and vs. My choice was based on simple predictor reduction as I don’t know anything about the content area and variables, but I might have made a different decision (e.g., keeping hp and vs) if one of those variables was particularly important or interpretable.

Predictor selection

The next step is selecting the predictors to include in the model alongside am. As I don’t know anything about the predictors, I will select to enter them into the model based purely on their relationship with the outcome (with higher correlations meaning they will be entered sooner). I wrote a function below which correlates each predictor with the outcome and ranks them (as absolute values) in descending order.

cor.list <- c()
outcome.cor <- function(predictor, ...) {
    x <- cor(mtcars$mpg, predictor)
    cor.list <- c(cor.list, x)
}
cor.list <- sapply(mtcars[ , c(3, 6:8, 10:11)], outcome.cor)
sort(abs(cor.list), decreasing = TRUE)
##        wt       cyl      drat      carb      gear      qsec 
## 0.8676594 0.8521620 0.6811719 0.5509251 0.4802848 0.4186840

You can see the predictor that has the strongest bivariate relationship with mpg is wt, then cyl, etc. We will use this order to enter variables into our model on top of am. One way of working out if adding a new variable improves the fit of the model is comparing models using the anova function. This function compares two nested models and returns the F-change and its associated significance level when adding the new variable(s).

When building the nested models, I will add one main effect at a time, following the bivariate relationship strength between each predictor and the outcome. I have written a function below which tests pairs of nested models and stores the two models and the significance of the F-change in a dataframe to make it easy to check whether a change improves the model fit. You can see I have run the models one by one and checked them, and only retained variables that improved model fit.

lmfits <- data.frame()
lmfit.table <- function(model1, model2, ...) {
    models <- sub("Model 1: ", "", attr(anova(model1, model2), "heading")[2])
    x <- c(sub("\\n.*", "", models),
           sub(".*Model 2: ", "", models), 
           round(anova(model1, model2)$"Pr(>F)"[2], 3))
    lmfits <- rbind(lmfits, x)
}

model1 <- lm(mpg ~ am.f, data = mtcars)
model2 <- lm(mpg ~ am.f + wt, data = mtcars)

lmfits <- lmfit.table(model1, model2)
for (i in 1:3) {
    lmfits[ , i] <- as.character(lmfits[ , i])
}
names(lmfits) <- c("Model 1", "Model 2", "p-value of model improvement")

model3 <- lm(mpg ~ am.f + wt + cyl.f, data = mtcars)
lmfits <- lmfit.table(model2, model3)

model4 <- lm(mpg ~ am.f + wt + cyl.f + drat, data = mtcars)
lmfits <- lmfit.table(model3, model4)

model5 <- lm(mpg ~ am.f + wt + cyl.f + carb.f, data = mtcars)
lmfits <- lmfit.table(model3, model5)

model6 <- lm(mpg ~ am.f + wt + cyl.f + gear.f, data = mtcars)
lmfits <- lmfit.table(model3, model6)

model7 <- lm(mpg ~ am.f + wt + cyl.f + qsec, data = mtcars)
lmfits <- lmfit.table(model3, model7)

require(knitr)
lmfits
Model 1 Model 2 p-value of model improvement
mpg ~ am.f mpg ~ am.f + wt 0
mpg ~ am.f + wt mpg ~ am.f + wt + cyl.f 0.003
mpg ~ am.f + wt + cyl.f mpg ~ am.f + wt + cyl.f + drat 0.861
mpg ~ am.f + wt + cyl.f mpg ~ am.f + wt + cyl.f + carb.f 0.764
mpg ~ am.f + wt + cyl.f mpg ~ am.f + wt + cyl.f + gear.f 0.564
mpg ~ am.f + wt + cyl.f mpg ~ am.f + wt + cyl.f + qsec 0.061

There were two variables that improved model fit in addition to am, which were wt and cyl. I will now check whether adding interaction terms between these variables and these variables improves model fit:

model8 <- lm(mpg ~ am.f + wt + cyl.f + am.f * wt, data = mtcars)
lmfits <- lmfit.table(model3, model8)

model9 <- lm(mpg ~ am.f + wt + cyl.f + am.f * wt + am.f * cyl.f, data = mtcars)
lmfits <- lmfit.table(model8, model9)

model10 <- lm(mpg ~ am.f + wt + am.f * wt, data = mtcars)
lmfits <- lmfit.table(model2, model10)

lmfits[7:9, ]
Model 1 Model 2 p-value of model improvement
mpg ~ am.f + wt + cyl.f mpg ~ am.f + wt + cyl.f + am.f * wt 0.007
mpg ~ am.f + wt + cyl.f + am.f * wt mpg ~ am.f + wt + cyl.f + am.f * wt + am.f * cyl.f 0.802
mpg ~ am.f + wt mpg ~ am.f + wt + am.f * wt 0.001

We now have two viable models, model8 and model10. To select between these, I will have a look at the $R^2$ and variance inflation factor (VIF) (in the car package) of each of the models.

require(car)
round(summary(model10)$r.squared, 3)
## [1] 0.833
vif(model10)
##      am.f        wt   am.f:wt 
## 20.901259  2.728248 15.366853
round(summary(model8)$r.squared, 3)
## [1] 0.877
vif(model8)[ , 1]
##      am.f        wt     cyl.f   am.f:wt 
## 24.302147  3.983261  3.060691 18.189413

The difference in $R^2$ between the two models is small, but the inclusion of cyl in the model both increases the variance inflation and decreases the interpretability of the model. Moreover, cyl is highly correlated with wt (0.782), meaning it is likely explaining a lot of the same variance as wt. As such, I dropped cyl and the final model included am, wt, and their interaction term.

Model diagnostics

Having chosen the final model, it is time to check whether it has any issues with how it fits the data. The built-in plot function conveniently displays four diagnostic plots for lm objects:

final.model <- lm(mpg ~ am.f + wt + am.f * wt, data = mtcars)
par(mfrow = c(2,2))
plot(final.model)

plot of chunk diagnostic_plots

The Residuals vs Fitted plot (and its standardised version, the Scale Location plot) show that higher MPG values tend to have higher residuals. In addition, there are three values with unusually high residual error (Merc 240DD, Fiat128 and Toyota Corolla), indicating that the model is a poor fit for both cars with high MPG (past about 28 MPG) and these three car types. The Normal Q-Q plot of residuals indicates that the errors are not normally distributed, again especially for high levels of MPG and the three specific car models that had high residuals. Finally, the Residuals vs Leverage plot demonstrates that are a couple of values with high leverage and low residuals, which may possibly be biasing the trend line.

To more closely examine the effect of variables with high leverage and/or influence on the regression line we can extract the hatvalues and the dfbeta values of the model respectively. First I will show the top 6 hatvalues:

mtcars$name <- row.names(mtcars)
mtcars$hatvalues <- round(hatvalues(final.model), 3)
top.leverage = sort(round(hatvalues(final.model), 3), decreasing = TRUE)
head(top.leverage)
##       Maserati Bora Lincoln Continental   Chrysler Imperial 
##               0.371               0.304               0.281 
##  Cadillac Fleetwood        Lotus Europa         Honda Civic 
##               0.254               0.253               0.216

The Maserati Bora appears to have a substantially higher leverage than the others in the list. Ok, so now let’s look at the dfbetas:

mtcars$dfbetas <- round(dfbetas(final.model)[, 2], 3)
top.influence = sort(round(abs(dfbetas(final.model)[, 2]), 3), decreasing = TRUE)
head(top.influence)
## Chrysler Imperial    Toyota Corolla      Lotus Europa         Merc 240D 
##             0.584             0.469             0.372             0.349 
##          Fiat 128          Merc 230 
##             0.327             0.231

Despite its high leverage, the Maserati Bora does not have high influence. Instead, the Chrysler Imperial and the Toyota Corolla are much higher than the other models. So how do we actually tell if these values are biasing the fit of the regression line? I wrote two ggplots which label the data points by name with the highest leverage and influence values. I chose the three highest, but you can easily change the code to give you whatever number you want.

# First build a plot of the model.
library(ggplot2)
gp <- ggplot(data=mtcars, aes(x=wt, y=mpg, colour=am.f)) + 
        geom_point(alpha = 0.7) +
        geom_abline(intercept = coef(final.model)[1], slope = coef(final.model)[3], 
                    size = 1, color = "#FF3721") +
        geom_abline(intercept = coef(final.model)[1] + coef(final.model)[2], 
                    slope = coef(final.model)[3] + coef(final.model)[4], 
                    size = 1, color = "#4271AE") +
        scale_colour_manual(name="Transmission", values =c("#FF3721", "#4271AE")) +
        ylab("Miles per gallon") +    
        xlab("Weight (`000 lbs)") +
        theme_bw()

# Leverage plot
g1 <- gp + geom_text(data=subset(mtcars, abs(hatvalues) >= top.leverage[3]), 
                     aes(wt,mpg,label=name), size = 4, hjust=1, vjust=0) +
        ggtitle("Three Points with Highest Leverage (hatvalues)")

# Influence plot
g2 <- gp + geom_text(data=subset(mtcars, abs(dfbetas) == top.influence[1]), 
                     aes(wt,mpg,label=name), size = 4, hjust = 1, vjust = 0) +
        geom_text(data=subset(mtcars, abs(dfbetas) == top.influence[2] | 
                                  abs(dfbetas) == top.influence[3]), 
                     aes(wt,mpg,label=name), size = 4, hjust = 0, vjust = 0) +
        ggtitle("Three Points with Highest Influence (dfbetas)")

library(gridExtra)
grid.arrange(g1, g2, nrow = 1, ncol = 2)

plot of chunk outlier_plots

You can see that none of the high leverage nor influence points seem to be biasing the fit of the regression lines.

I hope this post has given you some useful tools to streamline your linear regression model-building process. I hope you can also see that while all of the decisions I made were data-driven, several were ultimately subjective, and there were a few alternative models that may have worked equally well. This highlights the fact that the “right” model is dependent on what you’re going to use it for, the overall interpretability, and what the content expert tells you is important.

plot of chunk interpretation

(Image via Research Wahlberg)

I have taken some of the information and code used in this post from the excellent Regression Models unit on Coursera by Brian Caffo, Jeff Leek and Roger Peng, and the model used was derived from work I did for the class assignment for that course.

Finally, the full code used to create the figures in this post is located in this gist on my Github page.