This is a report of the analysis of modeling the variation in the values of mileage per gallon (mpg) in terms of the variables in the mtcars data set. The following are the findings.
Disregarding the effect of other variables, manual transmission gives better mileage performance by 7.24 mpg over automatic transmission. However, in the presence of other variables, this difference isnot extremely large, suggesting that transmission type is a confounding variable.
A very good and parsimonious multivariate linear model explains mpg in terms of the weight (wt), number of cylinders (cyl), horsepower (hp), and type of transmission (am) of a vehicle.
The source for this project can be found at https://github.com/josephuses/coursera_regression_project.
library(tidyverse)
library(ggfortify)
library(GGally)Let us first conduct some exploratory data analysis to familiarize ourselves with the mtcars data and look at the behavior of some variables, specially am.
# dataset
data(mtcars)
# Mean and five number summaries of mpg variable:
summary(mtcars$mpg)#:> Min. 1st Qu. Median Mean 3rd Qu. Max.
#:> 10.400000 15.425000 19.200000 20.090625 22.800000 33.900000
# Convert am, cyl, gear, and vs as factor variables
mtcarsf <- mutate(mtcars, am = factor(am, labels = c("automatic", "manual")),
cyl = factor(cyl), gear = factor(gear), vs = factor(vs, levels = c(0, 1),
labels = c("V", "S")))The boxplot of mpg according to am clearly shows that manual type transmission outperforms automatic transmissions in mileage. We shall now test this hypothesis by building linear models explaining the variation in the values of mpg.
The boxplot in Appendix 4.1 suggests that manual has better milleage per gallon than automatic. Although we can conduct a t test to determine if this difference did not happen by chance alone, we use linear models and regression to explain this difference.
mod1 <- lm(mpg ~ am, data = mtcarsf)
summary(mod1) %>% broom::tidy()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 17.14736842105 | 1.12460254124 | 15.2474921514 | 0.000000000000 |
| ammanual | 7.24493927126 | 1.76442163164 | 4.1061269831 | 0.000285020744 |
summary(mod1) %>% broom::glance()| r.squared | adj.r.squared | sigma | statistic | p.value | df |
|---|---|---|---|---|---|
| 0.359798943425 | 0.338458908206 | 4.90202882893 | 16.8602788013 | 0.000285020744 | 2 |
We have quantified this difference to be an increase of 7.25 miles per gallon of manual over automatic transmission vehicles, and this is more extreme to be explained by chance occurrence alone (\(p=2.85020743935e-04\)) at a significance level of 0.05. However, the R-sqaured of 0.359798943425 shows that this univariate model explains only 35.98% of the variation in mpg. We can look at other multivariate linear models in order to see if we can improve the model fit. We can try explaining the variations in mpg with all of the other variables in mtcars but we can speed up our analysis by including only those variables that have a correlation higher than that of mpg and am.
cors <- cor(mtcars$mpg, mtcars)
cors[, order(-abs(cors[1, ]))] %>% as_tibble()| value | |
|---|---|
| mpg | 1.000000000000 |
| wt | -0.867659376517 |
| cyl | -0.852161959427 |
| disp | -0.847551379262 |
| hp | -0.776168371827 |
| drat | 0.681171907807 |
| vs | 0.664038919128 |
| am | 0.599832429455 |
| carb | -0.550925073902 |
| gear | 0.480284757339 |
| qsec | 0.418684033922 |
submtcars <- mtcarsf %>% select(mpg, wt, cyl, disp, hp, drat, vs, am)
mod2 <- lm(mpg ~ wt + cyl + disp + hp + drat + vs + am, data = mtcarsf)
summary(mod2) %>% broom::tidy()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 29.829969134145 | 6.744467882345 | 4.422879559147 | 0.000196207375 |
| wt | -2.594622673563 | 1.201295382097 | -2.159854031099 | 0.041448570735 |
| cyl6 | -2.055523435075 | 1.803107893903 | -1.139989149859 | 0.266023824613 |
| cyl8 | -0.023304442903 | 3.816510169477 | -0.006106217950 | 0.995180628128 |
| disp | 0.004360162924 | 0.013036105045 | 0.334468225684 | 0.741057132785 |
| hp | -0.035794755877 | 0.014634234706 | -2.445960215557 | 0.022513820233 |
| drat | 0.388141032956 | 1.466060243736 | 0.264751080056 | 0.793559398226 |
| vsS | 2.004897599760 | 1.829948485451 | 1.095603300148 | 0.284592667270 |
| ammanual | 2.558988827793 | 1.743021265173 | 1.468134026202 | 0.155611776081 |
summary(mod2) %>% broom::glance()| r.squared | adj.r.squared | sigma | statistic | p.value | df |
|---|---|---|---|---|---|
| 0.873313066274 | 0.829248045848 | 2.49046412136 | 19.8187373528 | 1.3507e-08 | 9 |
mod2 explains 87.33% of the variance in mpg. In this model, the weight (wt) and horsepower (hp) are the only ones showing a significant effect on the variation in mpg.
Looking at the diagnostic plots (Appendix 4.2), the Normal Q-Q plot and Residuals vs Fitted are somewhat okay for mod2, however the Scale-Location plot is showing some causes for worry of violations of homoscedasticity. Overall, mod2 seems to be a good model.
We now see if a bidirectional stepwise selection will yield a better model.
mod3 <- step(mod2, direction = "both", trace = FALSE)
summary(mod3) %>% broom::tidy()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 33.708323901280 | 2.604886184504 | 12.940421006417 | 0.000000000001 |
| wt | -2.496829420354 | 0.885587793485 | -2.819403608228 | 0.009081407558 |
| cyl6 | -3.031344490503 | 1.407283510716 | -2.154039656843 | 0.040682717936 |
| cyl8 | -2.163675323179 | 2.284251724546 | -0.947214048228 | 0.352250869148 |
| hp | -0.032109429991 | 0.013692574260 | -2.345025075769 | 0.026934605236 |
| ammanual | 1.809211382941 | 1.396304503014 | 1.295714064544 | 0.206459673770 |
summary(mod3) %>% broom::glance()| r.squared | adj.r.squared | sigma | statistic | p.value | df |
|---|---|---|---|---|---|
| 0.865879872487 | 0.840087540273 | 2.41011962976 | 33.5712127658 | 1.51e-10 | 6 |
mod3 explains about 86.59% of the the variation in mpg but is more parsimonious than mod2. In this model, apart from the type of transmission, horsepower, the type of cylinder and weight of the vehicle are used to explain the variation in the milleage of the vehicles. The Normal Q-Q plot and the Residuals vs Fitted plot (Appendix 4.3) look okay, but the Scale-Location plot still show some signs of violations of equality of variance. We will see if adding disp, drat and vs (mod2) significantly improves the model fit.
anova(mod3, mod2)| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 26 | 151.025592373 | NA | NA | NA | NA |
| 23 | 142.655465415 | 3 | 8.37012695887 | 0.449831860033 | 0.719832917309 |
From the results, we see that mod2 is not a great improvement over mod3. We can therefore choose mod3 to explain the variation in the mileage per gallon of the vehicles.
Using mod3 to explain variation in the mpg, we can say that when all other variables are held constant:
With the foregoing analyses, we therefore conclude that:
Disregarding the effect of other variables, manual transmission gives better mileage performance by 7.24 mpg over automatic transmission. However, in the presence of other variables, this difference is not extremely large, suggesting that transmission type is a confounding variable.
A very good and parsimonious multivariate linear model explains mpg in terms of the weight (wt),number of cylinders (cyl), horsepower (hp), and type of transmission (am) of a vehicle.
The results of linear regression and linear models are easy to explain and for this reason are recommended for use as initial models in many modeling problems. However, in explaining the variation in milliage per gallon of cars, we saw from the diagnostic plots that the variation may be explained by nonlinear models better. Further investigation should be conducted along this line.
This part was not submitted to Coursera.
In hindsight, we might want to be looking at all other variables in the mtcars data set. What will happen if we do stepwise selection using the whole set of variables?
mod4 <- lm(mpg ~ ., data = mtcarsf)
mod5 <- step(mod4, direction = "both", trace = FALSE)
summary(mod5)#:>
#:> Call:
#:> lm(formula = mpg ~ wt + qsec + am, data = mtcarsf)
#:>
#:> Residuals:
#:> Min 1Q Median 3Q Max
#:> -3.481066996 -1.555525403 -0.725730867 1.411012209 4.660998270
#:>
#:> Coefficients:
#:> Estimate Std. Error t value Pr(>|t|)
#:> (Intercept) 9.617780515 6.959592985 1.38195 0.17791517
#:> wt -3.916503725 0.711201635 -5.50688 6.9527e-06
#:> qsec 1.225885972 0.288669554 4.24668 0.00021617
#:> ammanual 2.935837192 1.410904515 2.08082 0.04671551
#:>
#:> Residual standard error: 2.45884649 on 28 degrees of freedom
#:> Multiple R-squared: 0.849663556, Adjusted R-squared: 0.83355608
#:> F-statistic: 52.7496394 on 3 and 28 DF, p-value: 1.21044643e-11
The result is that weight, 1/4 mile time (qsec), and type of transmission are now the chosen predictors, with each being significant predictors of mpg. This model has a very good fit too with R-squared of 0.8497, which means that this model explains about 85% of the variation in mpg. However, if we pull up the diagnostic plots, we see that the Residuals vs Fitted plot shows obvious deviations of the errors from normality, and that the Scale-Location plot is more suspect for violation of the assumption of heteroskedasticity.
Furthermore, if we pull up a dendogram of the hierarchical clustering of the variables of the mtcars data set, we can see that qsec and mpg are clustered together very closely, which means that we may not be getting additional information by modeling the variation in mpg with qsec. In other words, qsec confounds other variables that might explain the variation in mpg better.
mpg by amggplot(mtcarsf, aes(am, mpg)) + geom_boxplot() + xlab("Transmission Type") +
ylab("Miles Per Gallon") + ggtitle("Automatic vs Manual Transmission")mod2autoplot(mod2)mod3autoplot(mod3)mod5autoplot(mod5)mtcarshc <- hclust(dist(t(mtcars)), "ave")
plot(hc)