week6LinearRegressionPart1Template.Rmd 8.6 KB
Newer Older
Ben Anderson's avatar
Ben Anderson committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
---
title: "FEEG6025: Week 6 - Linear Regression Part I (mtcars)"
author: "you@soton.ac.uk"
date: 'Last run at: `r Sys.time()`'
output:
  html_document:
    fig_caption: yes
    number_sections: yes
    toc: yes
    toc_float: yes
  pdf_document:
    fig_caption: yes
    number_sections: yes
    toc: yes
  word_document:
    fig_caption: yes
    toc: yes
---

# Set up libraries

You may need to install the following libraries first:

 * ggplot2
 * data.table
 * readr
 * stargazer
 * car
 * knitr

```{r setup, include=FALSE}
# DO NOT EDIT ANYTHING IN THIS CHUNK!!!
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(eval = TRUE)

library(ggplot2) # for graphs
library(readr) # for fast data reads
library(data.table) # for data munching
library(stargazer) # for nice (but slightly complex) tables
library(car) # for reression tools & tests
library(knitr) # for nicely formatted tables using kable

```

# Predicting fuel consumption

In the first part of the practical we will use the R `mtcars` dataset to try to predict fuel consumption based on a number of characteristics of the car.

## Load mtcars data
First copy the `mtcars` data to a data table for easy manipulation:

```{r loadMtcars}
# DO NOT EDIT ANYTHING IN THIS CHUNK!!!
mtcarsDT <- as.data.table(mtcars)
```

`mtcarsDT` now contains the following:

 * mpg	Miles/(US) gallon
 * cyl	Number of cylinders
 * disp	Displacement (cu.in.)
 * hp	Gross horsepower
 * drat	Rear axle ratio
 * wt	Weight (1000 lbs)
 * qsec	1/4 mile time
 * vs	V/S
 * am	Transmission (0 = automatic, 1 = manual)
 * gear	Number of forward gears
 * carb	Number of carburetors

## Inspect mtcars data

First we need to check some basic assumptions about the relationships between the variables we are interested in. We are going to use mpg, qsec and wt. We need the relationships between the predictor (independent) variables and the outcome (dependent) variable to be close to linear and we need the outcome variable (mpg) to be normaly distributed.

```{r inspectMtcars}


# establish normality of mpg (outcome variable of interest)


# shapiro-wilks
# if p > 0.05 => normal 


```

Questions:

 * Are the relationships linear?
 * Is mpg normally distirbuted?

We will also test the distributions of the predictor (independent) variables:

```{r testPredictors}

```

Questions:

 * What do these histograms tell us?
 * Is there anything we should watch out for?
 
## Simple model (1 predictor)
Now we will develop a simple model predicting mpg from qsec (time to go 1/4 mile from stationary):

```{r mpgModel1}


# basic results


# ggplot visualisation of the model

```

Questions:

 * What does the model tell us?

### Model checks: Normality of residuals

```{r checkResidMpgModel1}
# Do they look normal?


# What does a shapiro-wilks test produce?
# if p > 0.05 => normal 

# The 'car' package has a nice graph to help here
#install.packages("car") # if needed

```
 
Questions:

 * What can we conclude from these tests?

**NB: Don't rely on the Shapiro-Wilk test with large n.**

### Model checks: Autocorrelation/independence of errors:

```{r checkAutocMpgModel1}
# Durbin-Watson
# if p < 0.05 then implies autocorrelation
# if Durbin–Watson is less than 1.0, there may be cause for alarm. 
# Small values of d indicate successive error terms are, on average, close in value to one another, or positively correlated. 
# If d > 2, successive error terms are, on average, much different in value from one another, i.e., negatively correlated.
# https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic#Computing_and_interpreting_the_Durbin.E2.80.93Watson_statistic


```

Questions:

 * What should we conclude?
 * Why?
 * Could we have suspected this from anything we checked earlier?

### Model checks: heteroskedasticity

```{r checkHeteroskMpgModel1}
library(car) # in case not already loaded

# formal test
# if p > 0.05 then there is heteroskedasticity

```

Questions:

 * Do you think the variance of the residuals is constant? Justify your answer.
 * Did the plot suggest a transformation?
 * If so, why?


### Summary

We can get the `stargazer` package to print a more informative table of the results:

```{r mpgModel1Results, results='asis'}
# NB: we need to force knitr to keep the html (asis)

# Visualise the model results using ggplot

```

You can see that we have asked stargzer to produce the confidence intervals for each co-efficient in brackets. Using the confidence intervals:

 * What are the lower and upper 95% confidence bounds for the effect of qsec on mpg?
 * What does the size of the 95% CI for the intercept tell you?
 

## Multivariate Model (2 predictors)

We now want to add a second predictor variable (`wt` = weight) to see if we can improve the fit of the model.


```{r mpgModel2}

# basic results


# Visualise the model results using ggplot

```

Questions:

 * What does this model tell us?
 * What does the size of the 95% CI for the intercept tell you now?

### Model checks: Normality of residuals

```{r checkResidMpgModel2}
# Do they look normal?


# What does a S-W test produce?
# if p > 0.05 => normal 


# The 'car' package has a nice graph to help here
#install.packages("car") # if needed

```
 
Questions:

 * What can we conclude from these tests?

**NB: Don't rely on the Shapiro-Wilk test with large n.**

### Model checks: Autocorrelation/independence of errors:

```{r checkAutocMpgModel2}
# Durbin-Watson
# if p < 0.05 then implies autocorrelation
# if p < 0.05 then implies autocorrelation
# if Durbin–Watson is less than 1.0, there may be cause for alarm. 
# Small values of d indicate successive error terms are, on average, close in value to one another, or positively correlated. 
# If d > 2, successive error terms are, on average, much different in value from one another, i.e., negatively correlated.
# https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic#Computing_and_interpreting_the_Durbin.E2.80.93Watson_statistic


```

Questions:

 * What should we conclude?
 * Why?

### Model checks: heteroskedasticity

```{r checkHeteroskMpgModel2}



# formal test
# if p > 0.05 then there is heteroskedasticity

```

Questions:

 * Do you think the variance of the residuals is constant? Justify your answer.
 * Did the plot suggest a transformation?
 * If so, why?

### Model checks: collinearity

This only makes sense to do when we have two or more independent variables as we want to check if they are co-linear (correlated).

```{r checkCollinMpgModel2}
# collinearity
# if any values > 10 -> problem


# tolerance
# if any values < 0.2 -> possible problem
# if any values < 0.1 -> definitely a problem

```

Questions:

 * What can we conclude from the vif test?
 * What can we conclude from the tolerance test?
 
### Summary

If we are reasonably happy with the models we can get the `stargazer` package to print a more informative table of the results of **both** models:

```{r mpgModelAllResults, results='asis'}
# NB: we need to force knitr to keep the html (asis)

```

Questions:

 * What effect has adding wt to the model had on the explanatory power of the model?
 * What effect has it had on the qsec coefficient? Why do you think that is?
 
You can see that we have again asked stargzer to produce the confidence intervals for each co-efficient in brackets. Using the confidence intervals:

 * What are the lower and upper 95% confidence bounds for the effect of qsec on mpg in each model?
 * Which model gives the more precise estimates? Explain your answer.

We can also use the AIC & BIC to test if our model has improved:

```{r modelsAICBIC}
# The smaller the AIC/BIC the better the fit

```

Questions:

 * What should we conclude from this?
 
Finally we can use the `car` package's anova function to test the difference between the models:

```{r compareMpgModels}

```

Questions:

 * What should we conclude from this test?

## Model 3 - Adding a categorical (factor) predictor

So far we have only used continuous variables. But it is quite common to need to use dummy variables (have a value of 0 (e.g. 'Yes') or 1 (e.g No)) or factor variables which may have several values but which are not continuous variables.

In the following example we add two such variables:

 * a dummy variable: vs (V engine or a straight engine)
 * a factor variable: gear (number of forward gears)

```{r mpgModel3}
# add labels to factors

# run the model

# test how many levels of the gears variable exist


# basic results


# Visualise the model results using ggplot

```

Normally we would run the same tests as above but for now:

 * Explain what the results for the dummy (transmission) variable are telling you
 * Explain what the results for the factor (gears) variable are telling you
# Conclusions

On the basis of these results, what would you do next?