1 Set up libraries

You may need to install the following libraries first:

  • ggplot2
  • data.table
  • readr
  • stargazer
  • car
  • polycor
  • broom
  • MASS
  • knitr

You may also need to update all packages. Make sure you include ‘dependencies’ if you do this.

2 Background

Today’s work will use the use the DECC NEED public use file, a sample of 49,815 records selected to be representative of the housing stock and available to all via the DECC website and data.gov.uk: https://data.gov.uk/dataset/nationalenergyefficiencydataframeworkanonymiseddata.

We will use it to explore two research questions:

  • What predicts high household gas consumption in 2012?
  • What predicts being a high, medium or low gas consumer in 2012?

These questions will require different modelling approaches.

3 Load and process data

First load the data.

## [1] "Loading: http://www.soton.ac.uk/~ba1e12/data/feeg6025/need_public_use_file_2014.csv.gz"
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   REGION = col_character(),
##   Gcons2005Valid = col_character(),
##   Gcons2006Valid = col_character(),
##   Gcons2007Valid = col_character(),
##   Gcons2008Valid = col_character(),
##   Gcons2009Valid = col_character(),
##   Gcons2010Valid = col_character(),
##   Gcons2011Valid = col_character(),
##   Gcons2012Valid = col_character(),
##   Econs2005Valid = col_character(),
##   Econs2006Valid = col_character(),
##   Econs2007Valid = col_character(),
##   Econs2008Valid = col_character(),
##   Econs2009Valid = col_character(),
##   Econs2010Valid = col_character(),
##   Econs2011Valid = col_character(),
##   Econs2012Valid = col_character()
## )
## See spec(...) for full column specifications.
## [1] "That produced a file of 49815 households."

Now process the data to remove years we are not interested in and tidy up variable labels etc. This leaves the following variables:

##  [1] "HH_ID"           "REGION"          "IMD_ENG"        
##  [4] "IMD_WALES"       "Gcons2012"       "Gcons2012Valid" 
##  [7] "MAIN_HEAT_FUEL"  "PROP_AGE"        "PROP_TYPE"      
## [10] "FLOOR_AREA_BAND" "EE_BAND"         "LOFT_DEPTH"     
## [13] "WALL_CONS"       "CWI"             "CWI_YEAR"       
## [16] "LI"              "LI_YEAR"         "BOILER"         
## [19] "BOILER_YEAR"

4 Inspecting the data

What is in the data and what are the central tendencies?

# Hint: use summary() and wrap the kable function around it to get a pretty table
kable(caption = "Summary of DECC NEED 2014 PUL file",
  summary(needGasDT)
)
Summary of DECC NEED 2014 PUL file
HH_ID REGION IMD_ENG IMD_WALES Gcons2012 Gcons2012Valid MAIN_HEAT_FUEL PROP_AGE PROP_TYPE FLOOR_AREA_BAND EE_BAND LOFT_DEPTH WALL_CONS CWI CWI_YEAR LI LI_YEAR BOILER BOILER_YEAR
Min. : 1 South East : 7633 Min. :1.000 Min. :1.00 Min. : 0 > 50k : 307 Gas :42546 before 1930 :13335 Detached : 7910 1 to 50 m2: 6776 A or B: 1437 Under 150 mm: 8163 Cavity:32242 Length:49815 Length:49815 Min. :1 Min. :2004 Min. :1 Length:49815
1st Qu.:12454 North West : 6605 1st Qu.:2.000 1st Qu.:2.00 1st Qu.: 8400 < 100 : 434 Electricity: 7269 1930-1949 : 7512 Semi-detached :12979 51-100 m2 :25171 C :12668 > 150 mm :29092 Other :17573 Class :character Class :character 1st Qu.:1 1st Qu.:2007 1st Qu.:1 Class :character
Median :24908 London : 6427 Median :3.000 Median :3.00 Median :12700 Missing : 2678 NA 1950-1966 : 8975 End terrace : 4530 101-150 m2:13503 D :21283 Unknown :12560 NA Mode :character Mode :character Median :1 Median :2010 Median :1 Mode :character
Mean :24908 East of England : 5241 Mean :2.993 Mean :2.92 Mean :13859 No gas : 6278 NA 1967-1982 : 9856 Mid terrace : 9857 > 151 m2 : 4365 E :10803 NA NA NA NA Mean :1 Mean :2009 Mean :1 NA
3rd Qu.:37362 West Midlands : 4909 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:18000 Valid 100 - 25k:40118 NA 1983-1995 : 5243 Bungalow : 5155 NA F : 2956 NA NA NA NA 3rd Qu.:1 3rd Qu.:2012 3rd Qu.:1 NA
Max. :49815 Yorkshire & The Humber: 4857 Max. :5.000 Max. :5.00 Max. :50000 NA NA 1996 onwards: 4894 Flat (incl. maisonette): 9384 NA G : 668 NA NA NA NA Max. :1 Max. :2012 Max. :1 NA
NA (Other) :14143 NA’s :2747 NA’s :47068 NA’s :9697 NA NA NA NA NA NA NA NA NA NA NA’s :39291 NA’s :39291 NA’s :37063 NA

5 Testing non-parametric distributions

5.1 Cavity wall insulation by dwelling age

We want to know if the distribution of the age of dwellings differes for those with/without cavity wall insulation.

kable(caption = "Cavity wall insulation by age of dwelling (counts)",
  table(needGasDT$PROP_AGE,needGasDT$CWI)
)
Cavity wall insulation by age of dwelling (counts)
Installed Not known
before 1930 695 12640
1930-1949 1712 5800
1950-1966 2968 6007
1967-1982 2856 7000
1983-1995 841 4402
1996 onwards 266 4628

Now re-run the table and create row proportions for ease of comparison:

kable(caption = "Cavity wall insulation by age of dwelling (row proportions)",
  round(prop.table(table(needGasDT$PROP_AGE,
                         needGasDT$CWI
                         )
                   ,1 # option 1 to prop.table gives column proportions
                   ) 
        ,2
        ) # round to 2 d.p
)
Cavity wall insulation by age of dwelling (row proportions)
Installed Not known
before 1930 0.05 0.95
1930-1949 0.23 0.77
1950-1966 0.33 0.67
1967-1982 0.29 0.71
1983-1995 0.16 0.84
1996 onwards 0.05 0.95

Please answer this and all the following questions in the blackboard test, NOT in the RMarkdown

  • Q1: Is there a difference in the distribution of cavity wall insulation between dwelling age groups?

We now need to test of the differences are statisitcally significant using a chi sq test:

  • Q2: What would you conclude from the cavity wall insulation chi sq test?

5.2 Loft depth by dwelling age

In this example we want to test the 3 possible loft depth values:

kable(caption = "Loft depth by age of dwelling (row proportions)",
  round(prop.table(table(needGasDT$PROP_AGE,
                         needGasDT$LOFT_DEPTH
                         )
                   ,1 # option 1 to prop.table gives column proportions
                   ) 
        ,2
        ) # round to 2 d.p
)
Loft depth by age of dwelling (row proportions)
Under 150 mm > 150 mm Unknown
before 1930 0.17 0.50 0.33
1930-1949 0.18 0.65 0.18
1950-1966 0.16 0.64 0.20
1967-1982 0.16 0.60 0.24
1983-1995 0.22 0.56 0.22
1996 onwards 0.09 0.59 0.32
  • Q3: What would you conclude from the Loft depth table of row proportions?

We now need to test of the differences are statisitcally significant using a log linear model:

# make xtab
loftXt <- xtabs(~ PROP_AGE + LOFT_DEPTH, needGasDT)
# use it in log linear model
  • Q4: Does the log linear test suggest a significant difference between some of the groups?

6 Predicting low gas consumption households

We want to test if the following characteristics are associated with low gas consumption:

  • property age (PROP_AGE)
  • property type (PROP_TYPE)
  • loft depth (LOFT_DEPTH)
  • floor area (FLOOR_AREA_BAND)

6.1 Prepare the data

We define high gas consumption as being in the bottom 20% of consumers. First create an indicator for these households so we can use it as our outcome variable:

# note that there are some NAs in the data so we have to tell R to exlcue these
  
gasCut <- quantile(needGasDT$Gcons2012, probs = c(0.2), na.rm = TRUE)

# create an indicator for highest 20% vs the rest
needGasDT$Gcons2012Lo <- ifelse(needGasDT$Gcons2012 > gasCut,
                                     0, # not low
                                     1 # low
                                     )

needGasDT$Gcons2012Lo <- factor(needGasDT$Gcons2012Lo,
                                     labels = c(
                                       "Not low",
                                       "Lowest 20%"
                                     )
                                     )

# check 
#prop.table(table(needGasDT$Gcons2012Lo))

# check that we have valid gas data for all of our variables
kable(caption="Check PROP_AGE",
      table(needGasDT$PROP_AGE,needGasDT$Gcons2012Lo, useNA = "always")
)
Check PROP_AGE
Not low Lowest 20% NA
before 1930 8504 2140 2691
1930-1949 5814 1061 637
1950-1966 6089 1767 1119
1967-1982 5897 1994 1965
1983-1995 2699 1005 1539
1996 onwards 2462 686 1746
NA 0 0 0
kable(caption="Check PROP_TYPE",
      table(needGasDT$PROP_TYPE,needGasDT$Gcons2012Lo, useNA = "always")
)
Check PROP_TYPE
Not low Lowest 20% NA
Detached 5848 279 1783
Semi-detached 10127 1600 1252
End terrace 3285 755 490
Mid terrace 6797 2232 828
Bungalow 3392 699 1064
Flat (incl. maisonette) 2016 3088 4280
NA 0 0 0
kable(caption="Check LOFT_DEPTH",
      table(needGasDT$LOFT_DEPTH,needGasDT$Gcons2012Lo, useNA = "always")
)
Check LOFT_DEPTH
Not low Lowest 20% NA
Under 150 mm 5602 1247 1314
> 150 mm 20200 4579 4313
Unknown 5663 2827 4070
NA 0 0 0
kable(caption="Check FLOOR_AREA_BAND",
      table(needGasDT$FLOOR_AREA_BAND,needGasDT$Gcons2012Lo, useNA = "always")
)
Check FLOOR_AREA_BAND
Not low Lowest 20% NA
1 to 50 m2 1134 2341 3301
51-100 m2 16581 5288 3302
101-150 m2 10919 914 1670
> 151 m2 2831 110 1424
NA 0 0 0

If we see ‘NA’ as a value in the row labels it means we are missing observations for this predictor variable.

6.2 Construct models

Now construct a logistic regression model to test which of the above variables predict low gas consumption in 2012 using the variables above.

6.3 Model diagnostics

Before we interpret the results we need to check the model assumptions. These are a subset of the ones we run for linear OLS models.

  • Q5: Which two variables appear to be to some extent collinear?

Although the values are not greater than 10 (see above) we will remove the variable with the highest variance inflation factor (vif) score, re-run the models and re-test the diagnostics.

  • Q6: Which variable should you remove?

Now re-run your corrected model:

  • Q7: Has removing that variable produced a more acceptable model?

6.4 Model Results

Now get stargazer to produce tidy outputs of the final model:

Note that stargazer has produced log odds which range from -ve values (large negative effect) through 0 (no effect) to +ve values (large positive effect). Note also that the first category of the factors is not displayed because the coefficients represent a comparison to this ‘missing’ category.

Questions:

  • Q8: How many observations are in the model?
  • Q9: Why is this not all of them?
  • Q10: What is the AIC score of the model?
  • Q11: What can we conclude about the relationship between property age and the probability of being a low gas consumer?
  • Q12: Which property age is the least likely to be in the low gas consumption group?
  • Q13: What can we conclude about the relationship between floor area and the probability of being a low gas consumer?
  • Q14: If your floor area is > 151 m2 how many times less likely are you to be in the low gas consumption group than if your floor area is < 50 m2
  • Q15: If we were to re-run this model for electricity do you think the model would perform as well?
  • Q16: Give reasons for your answer.