You may need to install the following libraries first:

- ggplot2
- data.table
- readr
- stargazer
- car
- MASS
- broom
- knitr

You may also need to update all packages. Make sure you include ‘dependencies’ if you do this.

Today’s work will use the use the DECC NEED public use file, a sample of 49,815 records selected to be representative of the housing stock and available to all via the DECC website and data.gov.uk: https://data.gov.uk/dataset/nationalenergyefficiencydataframeworkanonymiseddata.

We will use it to explore two research questions:

- What predicts high household gas consumption in 2012?
- What predicts being a high, medium or low gas consumer in 2012?

These questions will require different modelling approaches.

First load the data.

`## [1] "Loading: http://www.soton.ac.uk/~ba1e12/data/feeg6025/need_public_use_file_2014.csv.gz"`

```
## Parsed with column specification:
## cols(
## .default = col_integer(),
## REGION = col_character(),
## Gcons2005Valid = col_character(),
## Gcons2006Valid = col_character(),
## Gcons2007Valid = col_character(),
## Gcons2008Valid = col_character(),
## Gcons2009Valid = col_character(),
## Gcons2010Valid = col_character(),
## Gcons2011Valid = col_character(),
## Gcons2012Valid = col_character(),
## Econs2005Valid = col_character(),
## Econs2006Valid = col_character(),
## Econs2007Valid = col_character(),
## Econs2008Valid = col_character(),
## Econs2009Valid = col_character(),
## Econs2010Valid = col_character(),
## Econs2011Valid = col_character(),
## Econs2012Valid = col_character()
## )
```

`## See spec(...) for full column specifications.`

`## [1] "That produced a file of 49815 households."`

Now process the data to remove years we are not interested in and tidy up variable labels etc. This leaves the following variables:

```
## [1] "HH_ID" "REGION" "IMD_ENG"
## [4] "IMD_WALES" "Gcons2012" "Gcons2012Valid"
## [7] "MAIN_HEAT_FUEL" "PROP_AGE" "PROP_TYPE"
## [10] "FLOOR_AREA_BAND" "EE_BAND" "LOFT_DEPTH"
## [13] "WALL_CONS" "CWI" "CWI_YEAR"
## [16] "LI" "LI_YEAR" "BOILER"
## [19] "BOILER_YEAR"
```

What is in the data and what are the central tendencies?

```
# Hint: use summary() and wrap the kable function around it to get a pretty table
kable(caption = "Summary of DECC NEED 2014 PUL file",
summary(needGasDT)
)
```

HH_ID | REGION | IMD_ENG | IMD_WALES | Gcons2012 | Gcons2012Valid | MAIN_HEAT_FUEL | PROP_AGE | PROP_TYPE | FLOOR_AREA_BAND | EE_BAND | LOFT_DEPTH | WALL_CONS | CWI | CWI_YEAR | LI | LI_YEAR | BOILER | BOILER_YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Min. : 1 | South East : 7633 | Min. :1.000 | Min. :1.00 | Min. : 0 | > 50k : 307 | Gas :42546 | before 1930 :13335 | Detached : 7910 | 1 to 50 m2: 6776 | A or B: 1437 | Under 150 mm: 8163 | Cavity:32242 | Length:49815 | Length:49815 | Min. :1 | Min. :2004 | Min. :1 | Length:49815 | |

1st Qu.:12454 | North West : 6605 | 1st Qu.:2.000 | 1st Qu.:2.00 | 1st Qu.: 8400 | < 100 : 434 | Electricity: 7269 | 1930-1949 : 7512 | Semi-detached :12979 | 51-100 m2 :25171 | C :12668 | > 150 mm :29092 | Other :17573 | Class :character | Class :character | 1st Qu.:1 | 1st Qu.:2007 | 1st Qu.:1 | Class :character | |

Median :24908 | London : 6427 | Median :3.000 | Median :3.00 | Median :12700 | Missing : 2678 | NA | 1950-1966 : 8975 | End terrace : 4530 | 101-150 m2:13503 | D :21283 | Unknown :12560 | NA | Mode :character | Mode :character | Median :1 | Median :2010 | Median :1 | Mode :character | |

Mean :24908 | East of England : 5241 | Mean :2.993 | Mean :2.92 | Mean :13859 | No gas : 6278 | NA | 1967-1982 : 9856 | Mid terrace : 9857 | > 151 m2 : 4365 | E :10803 | NA | NA | NA | NA | Mean :1 | Mean :2009 | Mean :1 | NA | |

3rd Qu.:37362 | West Midlands : 4909 | 3rd Qu.:4.000 | 3rd Qu.:4.00 | 3rd Qu.:18000 | Valid 100 - 25k:40118 | NA | 1983-1995 : 5243 | Bungalow : 5155 | NA | F : 2956 | NA | NA | NA | NA | 3rd Qu.:1 | 3rd Qu.:2012 | 3rd Qu.:1 | NA | |

Max. :49815 | Yorkshire & The Humber: 4857 | Max. :5.000 | Max. :5.00 | Max. :50000 | NA | NA | 1996 onwards: 4894 | Flat (incl. maisonette): 9384 | NA | G : 668 | NA | NA | NA | NA | Max. :1 | Max. :2012 | Max. :1 | NA | |

NA | (Other) :14143 | NA’s :2747 | NA’s :47068 | NA’s :9697 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :39291 | NA’s :39291 | NA’s :37063 | NA |

We want to know if the distribution of the age of dwellings differes for those with/without cavity wall insulation.

```
kable(caption = "Cavity wall insulation by age of dwelling (counts)",
table(needGasDT$PROP_AGE,needGasDT$CWI)
)
```

Installed | Not known | |
---|---|---|

before 1930 | 695 | 12640 |

1930-1949 | 1712 | 5800 |

1950-1966 | 2968 | 6007 |

1967-1982 | 2856 | 7000 |

1983-1995 | 841 | 4402 |

1996 onwards | 266 | 4628 |

Now re-run the table and create row proportions for ease of comparison:

```
kable(caption = "Cavity wall insulation by age of dwelling (row proportions)",
round(prop.table(table(needGasDT$PROP_AGE,
needGasDT$CWI
)
,1 # option 1 to prop.table gives column proportions
)
,2
) # round to 2 d.p
)
```

Installed | Not known | |
---|---|---|

before 1930 | 0.05 | 0.95 |

1930-1949 | 0.23 | 0.77 |

1950-1966 | 0.33 | 0.67 |

1967-1982 | 0.29 | 0.71 |

1983-1995 | 0.16 | 0.84 |

1996 onwards | 0.05 | 0.95 |

**Please answer this and all the following questions in the blackboard test, NOT in the RMarkdown**

- Q1: Is there a difference in the distribution of cavity wall insulation between dwelling age groups?

We now need to test of the differences are statisitcally significant using a chi sq test:

`chisq.test(needGasDT$PROP_AGE,needGasDT$CWI)`

```
##
## Pearson's Chi-squared test
##
## data: needGasDT$PROP_AGE and needGasDT$CWI
## X-squared = 4165.1, df = 5, p-value < 2.2e-16
```

- Q2: What would you conclude from the cavity wall insulation chi sq test?

In this example we want to test the 3 possible loft depth values:

```
kable(caption = "Loft depth by age of dwelling (row proportions)",
round(prop.table(table(needGasDT$PROP_AGE,
needGasDT$LOFT_DEPTH
)
,1 # option 1 to prop.table gives column proportions
)
,2
) # round to 2 d.p
)
```

Under 150 mm | > 150 mm | Unknown | |
---|---|---|---|

before 1930 | 0.17 | 0.50 | 0.33 |

1930-1949 | 0.18 | 0.65 | 0.18 |

1950-1966 | 0.16 | 0.64 | 0.20 |

1967-1982 | 0.16 | 0.60 | 0.24 |

1983-1995 | 0.22 | 0.56 | 0.22 |

1996 onwards | 0.09 | 0.59 | 0.32 |

- Q3: What would you conclude from the Loft depth table of row proportions?

We now need to test of the differences are statisitcally significant using a log linear model:

```
# make xtab
loftXt <- xtabs(~ PROP_AGE + LOFT_DEPTH, needGasDT)
# use it in log linear model
loglm(~ PROP_AGE + LOFT_DEPTH, loftXt, fit=TRUE)
```

```
## Call:
## loglm(formula = ~PROP_AGE + LOFT_DEPTH, data = loftXt, fit = TRUE)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 1265.937 10 0
## Pearson 1237.130 10 0
```

- Q4: Does the log linear test suggest a significant difference between some of the groups?

We want to test if the following characteristics are associated with low gas consumption:

- property age (PROP_AGE)
- property type (PROP_TYPE)
- loft depth (LOFT_DEPTH)
- floor area (FLOOR_AREA_BAND)

We define high gas consumption as being in the bottom 20% of consumers. First create an indicator for these households so we can use it as our outcome variable:

```
# note that there are some NAs in the data so we have to tell R to exlcue these
gasCut <- quantile(needGasDT$Gcons2012, probs = c(0.2), na.rm = TRUE)
# create an indicator for highest 20% vs the rest
needGasDT$Gcons2012Lo <- ifelse(needGasDT$Gcons2012 > gasCut,
0, # not low
1 # low
)
needGasDT$Gcons2012Lo <- factor(needGasDT$Gcons2012Lo,
labels = c(
"Not low",
"Lowest 20%"
)
)
# check
#prop.table(table(needGasDT$Gcons2012Lo))
# check that we have valid gas data for all of our variables
kable(caption="Check PROP_AGE",
table(needGasDT$PROP_AGE,needGasDT$Gcons2012Lo, useNA = "always")
)
```

Not low | Lowest 20% | NA | |
---|---|---|---|

before 1930 | 8504 | 2140 | 2691 |

1930-1949 | 5814 | 1061 | 637 |

1950-1966 | 6089 | 1767 | 1119 |

1967-1982 | 5897 | 1994 | 1965 |

1983-1995 | 2699 | 1005 | 1539 |

1996 onwards | 2462 | 686 | 1746 |

NA | 0 | 0 | 0 |

```
kable(caption="Check PROP_TYPE",
table(needGasDT$PROP_TYPE,needGasDT$Gcons2012Lo, useNA = "always")
)
```

Not low | Lowest 20% | NA | |
---|---|---|---|

Detached | 5848 | 279 | 1783 |

Semi-detached | 10127 | 1600 | 1252 |

End terrace | 3285 | 755 | 490 |

Mid terrace | 6797 | 2232 | 828 |

Bungalow | 3392 | 699 | 1064 |

Flat (incl. maisonette) | 2016 | 3088 | 4280 |

NA | 0 | 0 | 0 |

```
kable(caption="Check LOFT_DEPTH",
table(needGasDT$LOFT_DEPTH,needGasDT$Gcons2012Lo, useNA = "always")
)
```

Not low | Lowest 20% | NA | |
---|---|---|---|

Under 150 mm | 5602 | 1247 | 1314 |

> 150 mm | 20200 | 4579 | 4313 |

Unknown | 5663 | 2827 | 4070 |

NA | 0 | 0 | 0 |

```
kable(caption="Check FLOOR_AREA_BAND",
table(needGasDT$FLOOR_AREA_BAND,needGasDT$Gcons2012Lo, useNA = "always")
)
```

Not low | Lowest 20% | NA | |
---|---|---|---|

1 to 50 m2 | 1134 | 2341 | 3301 |

51-100 m2 | 16581 | 5288 | 3302 |

101-150 m2 | 10919 | 914 | 1670 |

> 151 m2 | 2831 | 110 | 1424 |

NA | 0 | 0 | 0 |

If we see ‘NA’ as a value in the row labels it means we are missing observations for this predictor variable.

Now construct a logistic regression model to test which of the above variables predict low gas consumption in 2012 using the variables above.

```
logitGasLoM1 <- glm(formula = Gcons2012Lo ~ PROP_AGE + PROP_TYPE + LOFT_DEPTH + FLOOR_AREA_BAND,
family = binomial(logit), needGasDT)
```

Before we interpret the results we need to check the model assumptions. These are a subset of the ones we run for linear OLS models.

```
# Diagnostics: independence of errors
# If p < 0.05 then a problem as implies autocorrelation
# but beware large sample sizes - you will get a small p value with large sample sizes, Instead look at the D-W statistic:
# if D-W is less than 1.0, there may be cause for alarm as small values of D-W indicate successive error terms are, on average, close in value to one another, or positively correlated.
# Values arund 2 re usually considered acceptable
# If D-W > 2, successive error terms are, on average, very different in value from one another, i.e., negatively correlated.
durbinWatsonTest(logitGasLoM1)
```

```
## lag Autocorrelation D-W Statistic p-value
## 1 0.0215829 1.956769 0
## Alternative hypothesis: rho != 0
```

```
# Diagnostics: collinearity (vif)
# if any values > 10 -> problem
vif(logitGasLoM1)
```

```
## GVIF Df GVIF^(1/(2*Df))
## PROP_AGE 1.295371 5 1.026217
## PROP_TYPE 3.223326 5 1.124166
## LOFT_DEPTH 1.461227 2 1.099460
## FLOOR_AREA_BAND 2.172279 3 1
```