week6LinearRegressionPart2Task.Rmd 14.8 KB
Newer Older
Ben Anderson's avatar
Ben Anderson committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
---
title: "FEEG6025: Week 6 - Linear Regression Part II (NEED)"
author: "you@soton.ac.uk"
date: 'Last run at: `r Sys.time()`'
output:
  html_document:
    fig_caption: yes
    number_sections: yes
    toc: yes
    toc_float: yes
  pdf_document:
    fig_caption: yes
    number_sections: yes
    toc: yes
  word_document:
    fig_caption: yes
    toc: yes
---

# Set up libraries

You may need to install the following libraries first:

 * ggplot2
 * data.table
 * readr
 * stargazer
 * car
 * knitr

```{r setup, include=FALSE}
# DO NOT EDIT ANYTHING IN THE SETUP CHUNK!!!
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(eval = FALSE)

library(ggplot2) # for graphs
library(readr) # for fast data reads
library(data.table) # for data munching
library(stargazer) # for nice (but slightly complex) tables
library(car) # for reression tools & tests
library(knitr) # for nicely formatted tables using kable

# set input file name & location
needfile <- "http://www.soton.ac.uk/~ba1e12/data/feeg6025/need_public_use_file_2014.csv.gz"
#needfile <- "~/OneDrive - University of Southampton/FEE/FEEG6025/Generic data/NEED/need_public_use_file_2014.csv.gz"
```

# Predicting domestic gas consumption in England & Wales

In the second part of the pratical we will use the DECC NEED public use file, a sample of 49,815 records selected to be representative of the housing stock and available to all via the DECC website and data.gov.uk: https://data.gov.uk/dataset/nationalenergyefficiencydataframeworkanonymiseddata.

## Load and process data

```{r loadNeed}
# DO NOT EDIT ANYTHING IN THIS CHUNK!!!
print(paste0("Loading: ", needfile))
# this may take a few seconds depending on the location you are loading from - it is quite large!
needPulf2014DT <- as.data.table(read_csv(needfile))
print(paste0("That produced a file of ", uniqueN(needPulf2014DT$HH_ID), " households."))
```


```{r processData}
# DO NOT EDIT ANYTHING IN THIS CHUNK!!!
# create/code some factor variables ----
# see https://blackboard.soton.ac.uk/webapps/blackboard/content/listContentEditable.jsp?content_id=_3196230_1&course_id=_170962_1 -> "Excel workbook explaining variable codes"

# Gas readings - too big/too small/missing/no gas/valid (see documentation)
needPulf2014DT$Gcons2012Valid <- factor(needPulf2014DT$Gcons2012Valid, 
                                        labels = c("> 50k", 
                                                   "< 100", 
                                                   "Missing",
                                                   "No gas",
                                                   "Valid 100 - 25k" ))

# Same for electricity - only two
needPulf2014DT$Econs2012Valid <- factor(needPulf2014DT$Econs2012Valid, 
                                        labels = c("> 25k", 
                                                   #"< 100", 
                                                   #"Missing",
                                                   "Valid 100 - 25k" ))

# Main heat fuel
needPulf2014DT$MAIN_HEAT_FUEL <- factor(needPulf2014DT$MAIN_HEAT_FUEL, 
                                        labels = c("Gas", "Electricity"))

# Do they have economy 7 electricity?
needPulf2014DT$E7Flag2012 <- factor(needPulf2014DT$E7Flag2012, 
                                    labels = c("Econ7 = Yes"))

# Age of property
needPulf2014DT$PROP_AGE <- factor(needPulf2014DT$PROP_AGE, 
                                  labels = c("before 1930",
                                             "1930-1949",
                                             "1950-1966",
                                             "1967-1982",
                                             "1983-1995",
                                             "1996 onwards"))

# Type of property
needPulf2014DT$PROP_TYPE <- factor(needPulf2014DT$PROP_TYPE, 
                                   labels = c("Detached",
                                              "Semi-detached",
                                              "End terrace",
                                              "Mid terrace",
                                              "Bungalow",
                                              "Flat (incl. maisonette)"))

# Floor area
needPulf2014DT$FLOOR_AREA_BAND <- factor(needPulf2014DT$FLOOR_AREA_BAND, 
                                         labels = c("1 to 50 m2",
                                                    "51-100 m2",
                                                    "101-150 m2",
                                                    "> 151 m2"))

# Electricity efficiency band
needPulf2014DT$EE_BAND <- factor(needPulf2014DT$EE_BAND, 
                                 labels = c("A or B",
                                            "C",
                                            "D",
                                            "E",
                                            "F",
                                            "G"))

# Wall construction
needPulf2014DT$WALL_CONS <- factor(needPulf2014DT$WALL_CONS, 
                                   labels = c("Cavity",
                                              "Other"))

# Loft depth
needPulf2014DT$LOFT_DEPTH <- factor(needPulf2014DT$LOFT_DEPTH,
                                    labels = c("Under 150 mm", 
                                               "> 150 mm",
                                               "Unknown")
                                    ) # 99 = NA (helpfully!)

# Region
needPulf2014DT$REGION <- factor(needPulf2014DT$REGION,
                                labels = c("North East", 
                                           "North West", 
                                           "Yorkshire & The Humber",
                                           "East Midlands",
                                           "West Midlands",
                                           "East of England",
                                           "London",
                                           "South East",
                                           "South West",
                                           "Wales"))

# leave li_year & boiler_year to later
```

# Inspecting the data

What is in the data and what are the central tendencies?

```{r inspectData}
# Hint: use summary() and wrap the kable function around it to get a pretty table
kable(caption = "Summary of DECC NEED 2014 PUL file",
  summary(needPulf2014DT)
)
```

 * Why are there NAs?
 
Hint: see above and also https://blackboard.soton.ac.uk/webapps/blackboard/content/listContentEditable.jsp?content_id=_3196230_1&course_id=_170962_1 -> "Excel workbook explaining variable codes".

How many properties do not have 'valid' elec/gas readings for 2012?

```{r checkValid}
# Hint: use table(x,y) and the econsvalid variable for the correct year
# Use kable again
kable(caption = "Valid gas/electricity responses in 2014",
      table(needPulf2014DT$Gcons2012Valid, 
            needPulf2014DT$Econs2011Valid,
            useNA = "always")
)
```

 * What can you conclude? (Hint: look at the documentation to see why)
 * Why did we include the `useNA = "always"` option?
 * What would have happened if we had not?

# Visualising Distributions

Have a look at gas consumption in 2012 to show the 'binning' applied by DECC prior to data release.

```{r gasConsumptionHisto2009}
gasConsumptionHisto2012 <- ggplot(needPulf2014DT)

gasConsumptionHisto2012 + geom_histogram(
  aes(
    x = Gcons2012
    ),
  binwidth = 100
  ) + 
  labs(
    title = "NEED 2014 Data: Annual domestic gas consumption 2012",
    x = "Gas consumption (kWh)"
    )
```

It seems that DECC have coded the data in strange ways... Look at the documentation https://blackboard.soton.ac.uk/webapps/blackboard/content/listContentEditable.jsp?content_id=_3196230_1&course_id=_170962_1 -> "Excel workbook explaining variable codes" to find out how.

 * What have they done?
 * What does this mean for the analysis?

# Research: What predicts gas consumption in 2012?

How to approach this:

 * Task: Run a linear regression model to understand the relationship between gas consumption in 2012 and dwelling characteristics

## Test assumptions

### Is the outcome variable (gas consumption in 2012) normally distributed?

```{r testGas2012}
# draw a histogram
hist(needPulf2014DT$Gcons2012)

# use a qqnorm plot
qqnorm(needPulf2014DT$Gcons2012); qqline(needPulf2014DT$Gcons2012, col = 2)

# use a shapiro test
try(
  shapiro.test(needPulf2014DT$Gcons2012)
)
```

 * Why do you think we wrapped the shapiro test in a `try()` function?
 * Why do you think the shapiro test failed to run?
 * What should we conclude from the qq plot?

If you think that gas consumption is not normally distirbuted then you could transform it using a sqrt function to try to fix this.

```{r fixGas2012}
needPulf2014DT$sqrtGcons2012 <- sqrt(needPulf2014DT$Gcons2012)
```

We can now re-check the distribution but we won't bother with the shapiro test.

```{r testSqrtGas2012}
# draw a histogram
hist(needPulf2014DT$sqrtGcons2012)

# use a qqnorm plot
qqnorm(needPulf2014DT$sqrtGcons2012); qqline(needPulf2014DT$sqrtGcons2012, col = 2)
```

 * Has that improved things?

### Are the observations likely to be independent? 

Justify your answer

## Does property age predict gas consumption in 2012?

Develop a linear model to predict gas consumption in 2012 based on `PROP_AGE`.

```{r modelGas2012}
# as an example we will test property age

# first look at the possible values
kable(caption = "Summary of property age by valid gas code (to check)",
      table(needPulf2014DT$PROP_AGE, 
            needPulf2014DT$Gcons2012Valid, 
            useNA = "always")
)

# now create the model
gasModel1 <- lm(sqrtGcons2012 ~  PROP_AGE,
               data = needPulf2014DT)

# look at the results
summary(gasModel1)
```

 * Notice that 1 category within each of your factor variables 'vanishes' in the regression summary output. Why?
 * What can you conclude from the results?

### Test diagnostics - is our model valid?

Normally we would check for collinearity and tolerance but we only have one independent variable so we don't need to.
 
Now check for homoskedasticity:
```{r model1HomoskCheck}
# hint: library(car) and then a spreadLevelPlot()
spreadLevelPlot(gasModel1) 
```

 * Do you think the variance of the residuals is constant?
 * Did the plot suggest a transformation? If so, why?

Now a formal test of homoskedasticity:
```{r testModel1Homosk}
ncvTest(gasModel1)
# if p > 0.05 then there is heteroskedasticity
```

 * What do you conclude?
 
Now test for normality of residuals
```{r model1NormResid}
# Hint: library(car) and then qqPlot
qqPlot(gasModel1) # shows default 95% CI
```

 * What do you conclude?
 
And finally test for autocorrelation/independence of errors:

```{r model1TestAuto}
durbinWatsonTest(gasModel1)
# if p < 0.05 then a problem as implies autocorrelation
```

 * What do you conclude?
 
```{r model1AICBIC}
extractAIC(gasModel1)
BIC(gasModel1)
```

Overall, what do these diagnostics tell you about your model?

### Results

Now produce a more useful output using stargazer. This is slightly more complicated but worth it.

```{r modelGas2012Results, results='asis'}
# NB: we need to force knitr to keep the html (asis)
stargazer(gasModel1, 
          ci = TRUE, 
          single.row = TRUE,
          type = "html")
```

You can see that we have asked stargzer to produce the confidence intervals for each co-efficient. Using the confidence intervals:

 * Which property age has the most precise estimate? (Hint: compare the width of the CI)
 * What does the value for 1930-1949 indicate?
 * What does the value for 1967-1982 indicate?
 
## Does adding a measure of loft insulation improve the model?

Use:

 * gasModel2 <- lm(sqrtGcons2012 ~ PROP_AGE + LOFT_DEPTH, data = needPulf2014DT)

```{r model2Gas2012}
# as an example we will test property age

# first look at the possible values
kable(caption = "Summary of loft insulation by valid gas code (to check)",
      table(needPulf2014DT$LOFT_DEPTH, 
            needPulf2014DT$Gcons2012Valid, 
            useNA = "always")
)

# now run the model
gasModel2 <- lm(sqrtGcons2012 ~ PROP_AGE + LOFT_DEPTH, data = needPulf2014DT)
```

### Check diagnostics

Collinearity:
```{r moodel2ColTest}
# Hint: vif()
vif(gasModel2)
# if any values > 10 -> problem
```

 * What do you conclude?
 
Next test for tolerance:

```{r model2TolTest}
# tolerance
1/vif(gasModel2)
# if any values < 0.2 -> possible problem
```

 * What do you conclude?
 
Check for homoskedasticity:
```{r model2HomoskCheck}
# hint: library(car) and then a spreadLevelPlot()
spreadLevelPlot(gasModel2) 
```

 * What do you conclude?
 
Now test for normality of residuals
```{r model2NormResid}
# Hint: library(car) and then qqPlot
qqPlot(gasModel2) # shows default 95% CI
```

 * What do you conclude?
 
And finally test for autocorrelation/independence of errors:

```{r model2TestAuto}
durbinWatsonTest(gasModel2)
# if p < 0.05 then a problem as implies autocorrelation
```

 * What do you conclude?
 
Overall, what do these diagnostics tell you about your model?

### Results

Now produce a useful output using stargazer.

```{r model2Gas2012Results, results='asis'}
# NB: we need to force knitr to keep the html (asis)
stargazer(gasModel2, 
          ci = TRUE, 
          single.row = TRUE,
          type = "html")
```

```{r model2AICBIC}
extractAIC(gasModel2)
BIC(gasModel2)
```

 * What can you conclude from the results?
 * How do they differ from model 1?
 * How should we interpret `unknown` LOFT_DEPTH?
 
## Comparing the models

We might want to know if model 2 is substantially better than model 1 at predicting gas consumption. We use the `car` package ANOVA function to do this using the two models as input:

```{r testModels}
anova(gasModel1, gasModel2)
```

 * What does this comparison tell us?
 
We can also get stargazer to present the models side by side so we can compare them:

```{r modelCombinedGas2012Results, results='asis'}
# NB: we need to force knitr to keep the html (asis)
stargazer(gasModel1, gasModel2, 
          ci = TRUE, 
          single.row = TRUE,
          type = "html")
```

Test AIC/BICC:

```{r modelsAICBIC}
extractAIC(gasModel1)
extractAIC(gasModel2)
BIC(gasModel1)
BIC(gasModel2)
```

 * What does this analysis tell us?

# Making predictions

Our models are useful for explaining the relationships (and thinking about causal links) between variables. But we can also use them to predict. For example:

 * what would we expect the gas consumption of a house built in 1950-1966  with > 150 mm of loft insulation to be if we used model 2? 
 * What woud be our 95% confidence interval for this estimate?

# Improving the model

If you have completed the above tasks then repeat the analysis for different combinations of variables to see if you can develop a better model.

# Acknowledgements
The NEED data is (c) Crown Copyright. It was published by Department of Energy and Climate Change under an [Open Government](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/) Licence.