1 Set up libraries

You may need to install the following libraries first:

  • ggplot2
  • data.table
  • readr
  • stargazer
  • car
  • MASS
  • broom
  • knitr

You may also need to update all packages. Make sure you include ‘dependencies’ if you do this.

2 Background

Today’s work will use the use the DECC NEED public use file, a sample of 49,815 records selected to be representative of the housing stock and available to all via the DECC website and data.gov.uk: https://data.gov.uk/dataset/nationalenergyefficiencydataframeworkanonymiseddata.

We will use it to explore two research questions:

  • What predicts high household gas consumption in 2012?
  • What predicts being a high, medium or low gas consumer in 2012?

These questions will require different modelling approaches.

3 Load and process data

First load the data.

## [1] "Loading: http://www.soton.ac.uk/~ba1e12/data/feeg6025/need_public_use_file_2014.csv.gz"
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   REGION = col_character(),
##   Gcons2005Valid = col_character(),
##   Gcons2006Valid = col_character(),
##   Gcons2007Valid = col_character(),
##   Gcons2008Valid = col_character(),
##   Gcons2009Valid = col_character(),
##   Gcons2010Valid = col_character(),
##   Gcons2011Valid = col_character(),
##   Gcons2012Valid = col_character(),
##   Econs2005Valid = col_character(),
##   Econs2006Valid = col_character(),
##   Econs2007Valid = col_character(),
##   Econs2008Valid = col_character(),
##   Econs2009Valid = col_character(),
##   Econs2010Valid = col_character(),
##   Econs2011Valid = col_character(),
##   Econs2012Valid = col_character()
## )
## See spec(...) for full column specifications.
## [1] "That produced a file of 49815 households."

Now process the data to remove years we are not interested in and tidy up variable labels etc. This leaves the following variables:

##  [1] "HH_ID"           "REGION"          "IMD_ENG"        
##  [4] "IMD_WALES"       "Gcons2012"       "Gcons2012Valid" 
##  [7] "MAIN_HEAT_FUEL"  "PROP_AGE"        "PROP_TYPE"      
## [10] "FLOOR_AREA_BAND" "EE_BAND"         "LOFT_DEPTH"     
## [13] "WALL_CONS"       "CWI"             "CWI_YEAR"       
## [16] "LI"              "LI_YEAR"         "BOILER"         
## [19] "BOILER_YEAR"

4 Inspecting the data

What is in the data and what are the central tendencies?

# Hint: use summary() and wrap the kable function around it to get a pretty table
kable(caption = "Summary of DECC NEED 2014 PUL file",
  summary(needGasDT)
)
Summary of DECC NEED 2014 PUL file
HH_ID REGION IMD_ENG IMD_WALES Gcons2012 Gcons2012Valid MAIN_HEAT_FUEL PROP_AGE PROP_TYPE FLOOR_AREA_BAND EE_BAND LOFT_DEPTH WALL_CONS CWI CWI_YEAR LI LI_YEAR BOILER BOILER_YEAR
Min. : 1 South East : 7633 Min. :1.000 Min. :1.00 Min. : 0 > 50k : 307 Gas :42546 before 1930 :13335 Detached : 7910 1 to 50 m2: 6776 A or B: 1437 Under 150 mm: 8163 Cavity:32242 Length:49815 Length:49815 Min. :1 Min. :2004 Min. :1 Length:49815
1st Qu.:12454 North West : 6605 1st Qu.:2.000 1st Qu.:2.00 1st Qu.: 8400 < 100 : 434 Electricity: 7269 1930-1949 : 7512 Semi-detached :12979 51-100 m2 :25171 C :12668 > 150 mm :29092 Other :17573 Class :character Class :character 1st Qu.:1 1st Qu.:2007 1st Qu.:1 Class :character
Median :24908 London : 6427 Median :3.000 Median :3.00 Median :12700 Missing : 2678 NA 1950-1966 : 8975 End terrace : 4530 101-150 m2:13503 D :21283 Unknown :12560 NA Mode :character Mode :character Median :1 Median :2010 Median :1 Mode :character
Mean :24908 East of England : 5241 Mean :2.993 Mean :2.92 Mean :13859 No gas : 6278 NA 1967-1982 : 9856 Mid terrace : 9857 > 151 m2 : 4365 E :10803 NA NA NA NA Mean :1 Mean :2009 Mean :1 NA
3rd Qu.:37362 West Midlands : 4909 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:18000 Valid 100 - 25k:40118 NA 1983-1995 : 5243 Bungalow : 5155 NA F : 2956 NA NA NA NA 3rd Qu.:1 3rd Qu.:2012 3rd Qu.:1 NA
Max. :49815 Yorkshire & The Humber: 4857 Max. :5.000 Max. :5.00 Max. :50000 NA NA 1996 onwards: 4894 Flat (incl. maisonette): 9384 NA G : 668 NA NA NA NA Max. :1 Max. :2012 Max. :1 NA
NA (Other) :14143 NA’s :2747 NA’s :47068 NA’s :9697 NA NA NA NA NA NA NA NA NA NA NA’s :39291 NA’s :39291 NA’s :37063 NA

5 Testing non-parametric distributions

5.1 Cavity wall insulation by dwelling age

We want to know if the distribution of the age of dwellings differes for those with/without cavity wall insulation.

kable(caption = "Cavity wall insulation by age of dwelling (counts)",
  table(needGasDT$PROP_AGE,needGasDT$CWI)
)
Cavity wall insulation by age of dwelling (counts)
Installed Not known
before 1930 695 12640
1930-1949 1712 5800
1950-1966 2968 6007
1967-1982 2856 7000
1983-1995 841 4402
1996 onwards 266 4628

Now re-run the table and create row proportions for ease of comparison:

kable(caption = "Cavity wall insulation by age of dwelling (row proportions)",
  round(prop.table(table(needGasDT$PROP_AGE,
                         needGasDT$CWI
                         )
                   ,1 # option 1 to prop.table gives column proportions
                   ) 
        ,2
        ) # round to 2 d.p
)
Cavity wall insulation by age of dwelling (row proportions)
Installed Not known
before 1930 0.05 0.95
1930-1949 0.23 0.77
1950-1966 0.33 0.67
1967-1982 0.29 0.71
1983-1995 0.16 0.84
1996 onwards 0.05 0.95

Please answer this and all the following questions in the blackboard test, NOT in the RMarkdown

  • Q1: Is there a difference in the distribution of cavity wall insulation between dwelling age groups?

We now need to test of the differences are statisitcally significant using a chi sq test:

chisq.test(needGasDT$PROP_AGE,needGasDT$CWI)
## 
##  Pearson's Chi-squared test
## 
## data:  needGasDT$PROP_AGE and needGasDT$CWI
## X-squared = 4165.1, df = 5, p-value < 2.2e-16
  • Q2: What would you conclude from the cavity wall insulation chi sq test?

5.2 Loft depth by dwelling age

In this example we want to test the 3 possible loft depth values:

kable(caption = "Loft depth by age of dwelling (row proportions)",
  round(prop.table(table(needGasDT$PROP_AGE,
                         needGasDT$LOFT_DEPTH
                         )
                   ,1 # option 1 to prop.table gives column proportions
                   ) 
        ,2
        ) # round to 2 d.p
)
Loft depth by age of dwelling (row proportions)
Under 150 mm > 150 mm Unknown
before 1930 0.17 0.50 0.33
1930-1949 0.18 0.65 0.18
1950-1966 0.16 0.64 0.20
1967-1982 0.16 0.60 0.24
1983-1995 0.22 0.56 0.22
1996 onwards 0.09 0.59 0.32
  • Q3: What would you conclude from the Loft depth table of row proportions?

We now need to test of the differences are statisitcally significant using a log linear model:

# make xtab
loftXt <- xtabs(~ PROP_AGE + LOFT_DEPTH, needGasDT)
# use it in log linear model
loglm(~ PROP_AGE + LOFT_DEPTH, loftXt, fit=TRUE)
## Call:
## loglm(formula = ~PROP_AGE + LOFT_DEPTH, data = loftXt, fit = TRUE)
## 
## Statistics:
##                       X^2 df P(> X^2)
## Likelihood Ratio 1265.937 10        0
## Pearson          1237.130 10        0
  • Q4: Does the log linear test suggest a significant difference between some of the groups?

6 Predicting low gas consumption households

We want to test if the following characteristics are associated with low gas consumption:

  • property age (PROP_AGE)
  • property type (PROP_TYPE)
  • loft depth (LOFT_DEPTH)
  • floor area (FLOOR_AREA_BAND)

6.1 Prepare the data

We define high gas consumption as being in the bottom 20% of consumers. First create an indicator for these households so we can use it as our outcome variable:

# note that there are some NAs in the data so we have to tell R to exlcue these
  
gasCut <- quantile(needGasDT$Gcons2012, probs = c(0.2), na.rm = TRUE)

# create an indicator for highest 20% vs the rest
needGasDT$Gcons2012Lo <- ifelse(needGasDT$Gcons2012 > gasCut,
                                     0, # not low
                                     1 # low
                                     )

needGasDT$Gcons2012Lo <- factor(needGasDT$Gcons2012Lo,
                                     labels = c(
                                       "Not low",
                                       "Lowest 20%"
                                     )
                                     )

# check 
#prop.table(table(needGasDT$Gcons2012Lo))

# check that we have valid gas data for all of our variables
kable(caption="Check PROP_AGE",
      table(needGasDT$PROP_AGE,needGasDT$Gcons2012Lo, useNA = "always")
)
Check PROP_AGE
Not low Lowest 20% NA
before 1930 8504 2140 2691
1930-1949 5814 1061 637
1950-1966 6089 1767 1119
1967-1982 5897 1994 1965
1983-1995 2699 1005 1539
1996 onwards 2462 686 1746
NA 0 0 0
kable(caption="Check PROP_TYPE",
      table(needGasDT$PROP_TYPE,needGasDT$Gcons2012Lo, useNA = "always")
)
Check PROP_TYPE
Not low Lowest 20% NA
Detached 5848 279 1783
Semi-detached 10127 1600 1252
End terrace 3285 755 490
Mid terrace 6797 2232 828
Bungalow 3392 699 1064
Flat (incl. maisonette) 2016 3088 4280
NA 0 0 0
kable(caption="Check LOFT_DEPTH",
      table(needGasDT$LOFT_DEPTH,needGasDT$Gcons2012Lo, useNA = "always")
)
Check LOFT_DEPTH
Not low Lowest 20% NA
Under 150 mm 5602 1247 1314
> 150 mm 20200 4579 4313
Unknown 5663 2827 4070
NA 0 0 0
kable(caption="Check FLOOR_AREA_BAND",
      table(needGasDT$FLOOR_AREA_BAND,needGasDT$Gcons2012Lo, useNA = "always")
)
Check FLOOR_AREA_BAND
Not low Lowest 20% NA
1 to 50 m2 1134 2341 3301
51-100 m2 16581 5288 3302
101-150 m2 10919 914 1670
> 151 m2 2831 110 1424
NA 0 0 0

If we see ‘NA’ as a value in the row labels it means we are missing observations for this predictor variable.

6.2 Construct models

Now construct a logistic regression model to test which of the above variables predict low gas consumption in 2012 using the variables above.

logitGasLoM1 <- glm(formula = Gcons2012Lo ~ PROP_AGE + PROP_TYPE + LOFT_DEPTH + FLOOR_AREA_BAND, 
                   family = binomial(logit), needGasDT)

6.3 Model diagnostics

Before we interpret the results we need to check the model assumptions. These are a subset of the ones we run for linear OLS models.

# Diagnostics: independence of errors
# If p < 0.05 then a problem as implies autocorrelation
# but beware large sample sizes - you will get a small p value with large sample sizes, Instead look at the D-W statistic:
# if D-W is less than 1.0, there may be cause for alarm as small values of D-W indicate successive error terms are, on average, close in value to one another, or positively correlated. 
# Values arund 2 re usually considered acceptable
# If D-W > 2, successive error terms are, on average, very different in value from one another, i.e., negatively correlated.
durbinWatsonTest(logitGasLoM1)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.0215829      1.956769       0
##  Alternative hypothesis: rho != 0
# Diagnostics: collinearity (vif)
# if any values > 10 -> problem
vif(logitGasLoM1)
##                     GVIF Df GVIF^(1/(2*Df))
## PROP_AGE        1.295371  5        1.026217
## PROP_TYPE       3.223326  5        1.124166
## LOFT_DEPTH      1.461227  2        1.099460
## FLOOR_AREA_BAND 2.172279  3        1