---
title: 'FEEG6025: Week 7 - Logistic and Probit Regression'
author: "you@soton.ac.uk"
date: 'Last run at: `r Sys.time()`'
output:
html_document:
fig_caption: yes
number_sections: yes
toc: yes
toc_float: yes
html_notebook:
fig_caption: yes
number_sections: yes
toc: yes
pdf_document:
fig_caption: yes
number_sections: yes
toc: yes
word_document:
fig_caption: yes
toc: yes
---
# Set up libraries
You may need to install the following libraries first:
* ggplot2
* data.table
* readr
* stargazer
* car
* MASS
* broom
* knitr
You may also need to update all packages. Make sure you include 'dependencies' if you do this.
```{r setup, include=FALSE}
# DO NOT EDIT ANYTHING IN THIS CHUNK!!!
startTime <- Sys.time()
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(eval = TRUE) # set to FALSE for the Task desc
# if these fail to load R does not stop but will generate errors later when their functions are called
library(ggplot2) # for graphs
library(readr) # for fast data reads
library(data.table) # for data munching
library(stargazer) # for nice (but slightly complex) tables
library(car) # for regression tools & tests
library(MASS) # for probit regression
library(broom) # for tidy regression outputs
library(knitr) # for nicely formatted tables using kable
# set input file name & location
needfile <- "http://www.soton.ac.uk/~ba1e12/data/feeg6025/need_public_use_file_2014.csv.gz"
#needfile <- "~/OneDrive - University of Southampton/FEE/FEEG6025/Generic data/NEED/need_public_use_file_2014.csv.gz"
```
# Background
Today's work will use the use the DECC NEED public use file, a sample of 49,815 records selected to be representative of the housing stock and available to all via the DECC website and data.gov.uk: https://data.gov.uk/dataset/nationalenergyefficiencydataframeworkanonymiseddata.
We will use it to explore two research questions:
* What predicts high household gas consumption in 2012?
* What predicts being a high, medium or low gas consumer in 2012?
These questions will require different modelling approaches.
# Load and process data
First load the data.
```{r loadNeed, echo = FALSE}
# Code will not show up in output
# DO NOT EDIT ANYTHING IN THIS CHUNK!!!
print(paste0("Loading: ", needfile))
# this may take a few seconds depending on the location you are loading from - it is quite large!
needPulf2014DT <- as.data.table(read_csv(needfile))
print(paste0("That produced a file of ", uniqueN(needPulf2014DT$HH_ID), " households."))
```
Now process the data to remove years we are not interested in and tidy up variable labels etc. This leaves the following variables:
```{r processData, echo=FALSE}
# Code will not show up in output
# DO NOT EDIT ANYTHING IN THIS CHUNK!!!
# create/code some factor variables ----
# see https://blackboard.soton.ac.uk/webapps/blackboard/content/listContentEditable.jsp?content_id=_3196230_1&course_id=_170962_1 -> "Excel workbook explaining variable codes"
# Gas readings - too big/too small/missing/no gas/valid (see documentation)
needPulf2014DT$Gcons2012Valid <- factor(needPulf2014DT$Gcons2012Valid,
labels = c("> 50k",
"< 100",
"Missing",
"No gas",
"Valid 100 - 25k" ))
# Same for electricity - only two
needPulf2014DT$Econs2012Valid <- factor(needPulf2014DT$Econs2012Valid,
labels = c("> 25k",
#"< 100",
#"Missing",
"Valid 100 - 25k" ))
# Main heat fuel
needPulf2014DT$MAIN_HEAT_FUEL <- factor(needPulf2014DT$MAIN_HEAT_FUEL,
labels = c("Gas", "Electricity"))
# Do they have economy 7 electricity?
needPulf2014DT$E7Flag2012 <- factor(needPulf2014DT$E7Flag2012,
labels = c("Econ7 = Yes"))
# Age of property
needPulf2014DT$PROP_AGE <- factor(needPulf2014DT$PROP_AGE,
labels = c("before 1930",
"1930-1949",
"1950-1966",
"1967-1982",
"1983-1995",
"1996 onwards"))
# Type of property
needPulf2014DT$PROP_TYPE <- factor(needPulf2014DT$PROP_TYPE,
labels = c("Detached",
"Semi-detached",
"End terrace",
"Mid terrace",
"Bungalow",
"Flat (incl. maisonette)"))
# Floor area
needPulf2014DT$FLOOR_AREA_BAND <- factor(needPulf2014DT$FLOOR_AREA_BAND,
labels = c("1 to 50 m2",
"51-100 m2",
"101-150 m2",
"> 151 m2"))
# Electricity efficiency band
needPulf2014DT$EE_BAND <- factor(needPulf2014DT$EE_BAND,
labels = c("A or B",
"C",
"D",
"E",
"F",
"G"))
# Wall construction
needPulf2014DT$WALL_CONS <- factor(needPulf2014DT$WALL_CONS,
labels = c("Cavity",
"Other"))
# Cavity wall insulation
needPulf2014DT$CWI <- ifelse(is.na(needPulf2014DT$CWI),
"Not known",
"Installed")
# Cavity wall insulation year
needPulf2014DT$CWI_YEAR <- ifelse(is.na(needPulf2014DT$CWI_YEAR),
"Not known",
needPulf2014DT$CWI_YEAR)
# Boiler installation year
needPulf2014DT$BOILER_YEAR <- ifelse(is.na(needPulf2014DT$BOILER_YEAR),
"Not known",
needPulf2014DT$BOILER_YEAR)
# Loft depth
needPulf2014DT$LOFT_DEPTH <- factor(needPulf2014DT$LOFT_DEPTH,
labels = c("Under 150 mm",
"> 150 mm",
"Unknown")
) # 99 = NA (helpfully!)
# Region
needPulf2014DT$REGION <- factor(needPulf2014DT$REGION,
labels = c("North East",
"North West",
"Yorkshire & The Humber",
"East Midlands",
"West Midlands",
"East of England",
"London",
"South East",
"South West",
"Wales"))
# leave li_year & boiler_year to later
# remove elec to form a simnple data table just with gas % household attribute data ----
needGasDT <- needPulf2014DT[, !grep("Econs|E7Flag|200[0-9]|2010|2011", names(needPulf2014DT)), with = FALSE]
names(needGasDT)
```
# Inspecting the data
What is in the data and what are the central tendencies?
```{r inspectData}
# Hint: use summary() and wrap the kable function around it to get a pretty table
kable(caption = "Summary of DECC NEED 2014 PUL file",
summary(needGasDT)
)
```
# Testing non-parametric distributions
## Cavity wall insulation by dwelling age
We want to know if the distribution of the age of dwellings differes for those with/without cavity wall insulation.
```{r cwiByAgeTableN}
kable(caption = "Cavity wall insulation by age of dwelling (counts)",
table(needGasDT$PROP_AGE,needGasDT$CWI)
)
```
Now re-run the table and create row proportions for ease of comparison:
```{r cwiByAgeTable%}
kable(caption = "Cavity wall insulation by age of dwelling (row proportions)",
round(prop.table(table(needGasDT$PROP_AGE,
needGasDT$CWI
)
,1 # option 1 to prop.table gives column proportions
)
,2
) # round to 2 d.p
)
```
**Please answer this and all the following questions in the blackboard test, NOT in the RMarkdown**
* Q1: Is there a difference in the distribution of cavity wall insulation between dwelling age groups?
We now need to test of the differences are statisitcally significant using a chi sq test:
```{r testCwiByAge}
chisq.test(needGasDT$PROP_AGE,needGasDT$CWI)
```
* Q2: What would you conclude from the cavity wall insulation chi sq test?
## Loft depth by dwelling age
In this example we want to test the 3 possible loft depth values:
```{r ldByAgeTable%}
kable(caption = "Loft depth by age of dwelling (row proportions)",
round(prop.table(table(needGasDT$PROP_AGE,
needGasDT$LOFT_DEPTH
)
,1 # option 1 to prop.table gives column proportions
)
,2
) # round to 2 d.p
)
```
* Q3: What would you conclude from the Loft depth table of row proportions?
We now need to test of the differences are statisitcally significant using a log linear model:
```{r logCwiByAge}
# make xtab
loftXt <- xtabs(~ PROP_AGE + LOFT_DEPTH, needGasDT)
# use it in log linear model
loglm(~ PROP_AGE + LOFT_DEPTH, loftXt, fit=TRUE)
```
* Q4: Does the log linear test suggest a significant difference between some of the groups?
# Predicting low gas consumption households
We want to test if the following characteristics are associated with low gas consumption:
* property age (PROP_AGE)
* property type (PROP_TYPE)
* loft depth (LOFT_DEPTH)
* floor area (FLOOR_AREA_BAND)
## Prepare the data
We define high gas consumption as being in the bottom 20% of consumers. First create an indicator for these households so we can use it as our outcome variable:
```{r setLowCons}
# note that there are some NAs in the data so we have to tell R to exlcue these
gasCut <- quantile(needGasDT$Gcons2012, probs = c(0.2), na.rm = TRUE)
# create an indicator for highest 20% vs the rest
needGasDT$Gcons2012Lo <- ifelse(needGasDT$Gcons2012 > gasCut,
0, # not low
1 # low
)
needGasDT$Gcons2012Lo <- factor(needGasDT$Gcons2012Lo,
labels = c(
"Not low",
"Lowest 20%"
)
)
# check
#prop.table(table(needGasDT$Gcons2012Lo))
# check that we have valid gas data for all of our variables
kable(caption="Check PROP_AGE",
table(needGasDT$PROP_AGE,needGasDT$Gcons2012Lo, useNA = "always")
)
kable(caption="Check PROP_TYPE",
table(needGasDT$PROP_TYPE,needGasDT$Gcons2012Lo, useNA = "always")
)
kable(caption="Check LOFT_DEPTH",
table(needGasDT$LOFT_DEPTH,needGasDT$Gcons2012Lo, useNA = "always")
)
kable(caption="Check FLOOR_AREA_BAND",
table(needGasDT$FLOOR_AREA_BAND,needGasDT$Gcons2012Lo, useNA = "always")
)
```
If we see 'NA' as a value in the row labels it means we are missing observations for this predictor variable.
## Construct models
Now construct a logistic regression model to test which of the above variables predict low gas consumption in 2012 using the variables above.
```{r modelLoCons}
logitGasLoM1 <- glm(formula = Gcons2012Lo ~ PROP_AGE + PROP_TYPE + LOFT_DEPTH + FLOOR_AREA_BAND,
family = binomial(logit), needGasDT)
```
## Model diagnostics
Before we interpret the results we need to check the model assumptions. These are a subset of the ones we run for linear OLS models.
```{r diagnosticsLoCon}
# Diagnostics: independence of errors
# If p < 0.05 then a problem as implies autocorrelation
# but beware large sample sizes - you will get a small p value with large sample sizes, Instead look at the D-W statistic:
# if D-W is less than 1.0, there may be cause for alarm as small values of D-W indicate successive error terms are, on average, close in value to one another, or positively correlated.
# Values arund 2 re usually considered acceptable
# If D-W > 2, successive error terms are, on average, very different in value from one another, i.e., negatively correlated.
durbinWatsonTest(logitGasLoM1)
# Diagnostics: collinearity (vif)
# if any values > 10 -> problem
vif(logitGasLoM1)
# Diagnostics: collinearity (tolerance)
# if any values < 0.2 -> possible problem
# if any values < 0.1 -> definitely a problem
1/vif(logitGasLoM1)
```
* Q5: Which two variables appear to be to some extent collinear?
Although the values are not greater than 10 (see above) we will remove the variable with the highest variance inflation factor (vif) score, re-run the models and re-test the diagnostics.
* Q6: Which variable should you remove?
Now re-run your corrected model:
```{r modelHiConsCorrected}
logitGasLoM2 <- glm(formula = Gcons2012Lo ~ PROP_AGE + LOFT_DEPTH + FLOOR_AREA_BAND ,
family = binomial(logit), needGasDT)
# Diagnostics: independence of errors
# If p < 0.05 then a problem as implies autocorrelation
# but beware large sample sizes - you will get a small p value with large sample sizes, Instead look at the D-W statistic:
# if D-W is less than 1.0, there may be cause for alarm as small values of D-W indicate successive error terms are, on average, close in value to one another, or positively correlated.
# Values arund 2 re usually considered acceptable
# If D-W > 2, successive error terms are, on average, very different in value from one another, i.e., negatively correlated.
durbinWatsonTest(logitGasLoM2)
# Diagnostics: collinearity (vif)
# if any values > 10 -> problem
vif(logitGasLoM2)
# Diagnostics: collinearity (tolerance)
# if any values < 0.2 -> possible problem
# if any values < 0.1 -> definitely a problem
1/vif(logitGasLoM2)
```
* Q7: Has removing that variable produced a more acceptable model?
## Model Results
Now get stargazer to produce tidy outputs of the final model:
```{r modelHiConsResults, results='asis'}
# NB: we need to force knitr to keep the html (asis)
stargazer(logitGasLoM2,
title = "Gas model results",
ci = TRUE,
single.row = TRUE,
type = "html")
# visualise using ggplot
library(broom) # in case you didn't
logitGasLoM2DF <- tidy(logitGasLoM2)
logitGasLoM2DF$ci_lower <- logitGasLoM2DF$estimate - qnorm(0.975) * logitGasLoM2DF$std.error
logitGasLoM2DF$ci_upper <- logitGasLoM2DF$estimate + qnorm(0.975) * logitGasLoM2DF$std.error
logitGasLoM2DF$model <- "Model 1"
plotModel <- ggplot(logitGasLoM2DF, aes(x = term, y = estimate)) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=ci_lower, ymax=ci_upper),
width=.2, # Width of the error bars
colour = "black", # make them visible
position=position_dodge(.9)
) +
labs(y = "Log odds") +
coord_flip() # turn it round to make it legible
plotModel
```
Note that stargazer has produced `log odds` which range from -ve values (large negative effect) through 0 (no effect) to +ve values (large positive effect). Note also that the first category of the factors is not displayed because the coefficients represent a comparison to this 'missing' category.
Questions:
* Q8: How many observations are in the model?
* Q9: Why is this not all of them?
* Q10: What is the AIC score of the model?
* Q11: What can we conclude about the relationship between property age and the probability of being a low gas consumer?
* Q12: Which property age is the least likely to be in the low gas consumption group?
* Q13: What can we conclude about the relationship between floor area and the probability of being a low gas consumer?
* Q14: If your floor area is > 151 m2 how many times less likely are you to be in the low gas consumption group than if your floor area is < 50 m2
* Q15: If we were to re-run this model for electricity do you think the model would perform as well?
* Q16: Give reasons for your answer.
# Predicting low gas consumption in England and Wales?
You will notice that there are two IMD_ variables:
* IMD_ENG: English index of deprivation decile for the dwellings in England
* IMD_WALES: Welsh index of deprivation decile for the dwellings in Wales
We would like to know whether households in more deprived areas (low IMD decile) are more likely to be low gas consumers controlling for floor area etc. To do this we need to run two models, one for England and one for Wales. Before we do that we will check the data using a bar chart.
```{r imdPlots}
# Create a ggplot of the mean of a value (with 95% CI) across a factor variable
ggplot(needGasDT, aes(IMD_ENG, Gcons2012)) +
stat_summary(fun.y = "mean", geom = "bar", fill = "grey70") +
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = 0.4) +
labs(title = "Mean annual gas consumption by English IMD decile (Error bars = 96% CI for the mean)")
ggplot(needGasDT, aes(IMD_WALES, Gcons2012)) +
stat_summary(fun.y = "mean", geom = "bar", fill = "grey70") +
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = 0.4) +
labs(title = "Mean annual gas consumption by Welsh IMD decile (Error bars = 96% CI for the mean)")
```
Now run your models.
```{r imdModels}
logitGasLoM2Eng <- glm(formula = Gcons2012Lo ~ PROP_AGE + LOFT_DEPTH + FLOOR_AREA_BAND + as.factor(IMD_ENG),
family = binomial(logit), needGasDT)
logitGasLoM2Wal <- glm(formula = Gcons2012Lo ~ PROP_AGE + LOFT_DEPTH + FLOOR_AREA_BAND + as.factor(IMD_WALES),
family = binomial(logit), needGasDT)
```
## Model Results
Normally we should also runt he diagnostic checks but for now we will get stargazer to produce tidy outputs of the final model:
```{r modelHiConsResultsIMD, results='asis'}
# NB: we need to force knitr to keep the html (asis)
stargazer(logitGasLoM2Wal,logitGasLoM2Eng,
title = "Gas model results",
ci = TRUE,
single.row = TRUE,
type = "html")
library(broom) # in case you didn't
logitGasLoM2WalDF <- tidy(logitGasLoM2Wal)
logitGasLoM2WalDF$ci_lower <- logitGasLoM2WalDF$estimate - qnorm(0.975) * logitGasLoM2WalDF$std.error
logitGasLoM2WalDF$ci_upper <- logitGasLoM2WalDF$estimate + qnorm(0.975) * logitGasLoM2WalDF$std.error
logitGasLoM2WalDF$model <- "Wales"
logitGasLoM2EngDF <- tidy(logitGasLoM2Eng)
logitGasLoM2EngDF$ci_lower <- logitGasLoM2EngDF$estimate - qnorm(0.975) * logitGasLoM2EngDF$std.error
logitGasLoM2EngDF$ci_upper <- logitGasLoM2EngDF$estimate + qnorm(0.975) * logitGasLoM2EngDF$std.error
logitGasLoM2EngDF$model <- "England"
# combine the results
modelsDF <- rbind(logitGasLoM2WalDF, logitGasLoM2EngDF)
# draw the plots using 'model' to seperate them
plotModels <- ggplot(modelsDF, aes(x = term, y = estimate)) +
geom_bar(aes(fill = model),
position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=ci_lower, ymax=ci_upper, group = model),
width=.2, # Width of the error bars
colour = "black", # make them visible
position=position_dodge(.9)
) +
labs(y = "Log odds") +
coord_flip() # turn it round to make it legible
plotModels
```
Questions:
* Q17: If you live in a high deprivation area (low IMD decile) are you more or less likely to be a low gas user in Wales?
* Q18: If you live in a low deprivation area (high IMD decile) are you more or less likely to be a low gas user in Wales?
* Q19: Do we see the same pattern in England?
* Q20: Overall what would you conclude from these models about whether people in deprived areas use more or less gas?
# Predicting low, medium and high gas users
In contrast to the previous questions, we now want to know what predicts being a low, medium or high gas user. To do this we will use an ordered logit (probit) regression using the same variables:
* property age (PROP_AGE)
* loft depth (LOFT_DEPTH)
* floor area (FLOOR_AREA_BAND)
## Prepare the data
First we need to create the low/medium/high categories. In this case we will define high as the top 20% and medium to be the middle 60%.
```{r gasCuts}
# note that there are some NAs in the data so we have to tell R to exlcue these
gasCutLo <- quantile(needGasDT$Gcons2012, probs = c(0.2), na.rm = TRUE)
gasCutHi <- quantile(needGasDT$Gcons2012, probs = c(0.8), na.rm = TRUE)
# create an indicator for highest 20% vs the rest
needGasDT$Gcons2012cuts <- ifelse(needGasDT$Gcons2012 <= gasCutLo,
1, # not low
2 # rest
)
needGasDT$Gcons2012cuts <- ifelse(needGasDT$Gcons2012 >= gasCutHi,
3, # high
needGasDT$Gcons2012cuts # copy previous
)
needGasDT$Gcons2012cuts <- factor(needGasDT$Gcons2012cuts,
labels = c(
"Lowest 20%",
"Middle 60%",
"Highest 20%"
)
)
# check
#prop.table(table(needGasDT$Gcons2012cuts))
```
## Run the model:
```{r modelGasCuts}
library(MASS) # in case you didn't do so before (install first)
probitGasCutsM1 <- glm(Gcons2012cuts ~ PROP_AGE + LOFT_DEPTH + FLOOR_AREA_BAND,
needGasDT, family=binomial(link="probit"))
```
## Model Results
Normally we would again conduct model diagnostics but for now just print the model results:
```{r modelHiGasCutsResults}
# NB: we need to force knitr to keep the html (asis)
summary(probitGasCutsM1)
```
In the output above, the first thing we see is the call, this is R reminding us what the model we ran was, what options we specified, etc.
Next we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model. Below we discuss how to use summaries of the deviance statistic to asses model fit.
The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values.
Below the table of coefficients are fit indices, including the null and deviance residuals and the AIC. Later we show an example of how you can use these values to help assess model fit.
Note that as before, this has produced `log odds` which range from -ve values (large negative effect) through 0 (no effect) to +ve values (large positive effect).
Questions:
* Q21: What is the AIC score of the model?
* Q22: What would you conclude from this model?
We can use the confint function to obtain confidence intervals for the coefficient estimates. These will be profiled confidence intervals by default, created by profiling the likelihood function. As such, they are not necessarily symmetric.
```{r probitConfint}
confint(probitGasCutsM1)
```
# About
Analysis completed in: `r round(Sys.time() - startTime, 2)` seconds using [knitr](https://cran.r-project.org/package=knitr) in [RStudio](http://www.rstudio.com) with `r R.version.string` running on `r R.version$platform`.