analyse-NEED-EULF-2014-models-v2.0.do 9.13 KB
Newer Older
1
2
* Script to analyse DECC's 2014 EULF NEED data to:
* - investigate % variance of energy consumption due to dwelling type variables as a way to infer the % of variance due to people
3
4
5
6
7
8

* NB this script uses 2 data files derived from the original data using the 'process' script

* Original data available from: UK DATA ARCHIVE: Study Number 7518 - National Energy Efficiency Data-Framework, 2014
* http://discover.ukdataservice.ac.uk/catalogue/?sn=7518

9
/*   
10

11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Copyright (C) 2014  University of Southampton

Author: Ben Anderson (b.anderson@soton.ac.uk, @dataknut, https://github.com/dataknut) 
	[Energy & Climate Change, Faculty of Engineering & Environment, University of Southampton]

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License 
(http://choosealicense.com/licenses/gpl-2.0/), or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

#YMMV - http://en.wiktionary.org/wiki/YMMV

*/
29
30
31
32
33
34
35
36
37
38
39

clear all

capture noisily log close

* written for Mac OSX - remember to change filesystem delimiter for other platforms
local home "/Users/ben/Documents"
local proot "`home'/Work/Data/Social Science Datatsets/DECC"
* for clam
* local proot "`home'/Work/NEED"
local dpath "`proot'/NEED/End User Licence File 2014/processed"
Ben Anderson's avatar
Ben Anderson committed
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
local rpath "`proot'/results/NEED"

*local verrsion "1.0"
* initial models - all households for electricity models

*local verrsion "1.1"
* restrict to gas only households to avoid complications of:
* - primary electric heating (presumably)
* - oil heating

*local version "v2a_1pc"
*local sample 1
*local sampleby "EE_BAND PROP_TYPE"
* changed from using log consumption to consumption decile to avoid complications due to variable rounding ranges in original data (see readme)
* restricted analysis to households where gas is main heat source as it is better predicted by variables included & is more relevant to EPC (heat)
* uses 1% sample (c 30k) making sure keep proportions of property type and EE_Band to see if linktest fails with smaller n

*local version "v2b_10pc"
*local sample 10
*local sampleby "EE_BAND PROP_TYPE"
* uses 10% sample (c 300k) making sure keep proportions of property type and EE_Band to see if margin plots and co-efficients are the same 
* (linktest etc will probably now fail due to larger n)

local version "v2c_full"
64
local sample 100
Ben Anderson's avatar
Ben Anderson committed
65
local sampleby "EE_BAND PROP_TYPE"
66
* uses full sample (c 3m) to see if margin plots and co-efficients are the same 
Ben Anderson's avatar
Ben Anderson committed
67
* (linktest etc will probably now fail due to larger n)
68
69
70
71
72
73
74
75

set more off

log using "`rpath'/analyse-NEED-EULF-2014-models-`version'-$S_DATE.smcl", replace

* use the pre-processed wide form file which contains all years of consumption data but not the constant values which are in the xwave file
use "`dpath'/need_eul_may2014_consumptionfile_wide.dta", clear

Ben Anderson's avatar
Ben Anderson committed
76
* we're going to use 2012 data only
77
78
79

keep HH_ID *2012*

Ben Anderson's avatar
Ben Anderson committed
80
81
* merge in the pre-processed cross-year fixed values file
merge 1:1 HH_ID using "`dpath'/need_eul_may2014_xwavefile.dta"
82
83

* check what's valid
Ben Anderson's avatar
Ben Anderson committed
84
tab Gcons2012Valid Econs2012Valid, mi // O = off gas, V = valid, L = too low, G = too big, M = missing
85
tabstat Gcons2012, by(Gcons2012Valid) s(mean min max n)
Ben Anderson's avatar
Ben Anderson committed
86
87
* do off-gas use a lot more electricty (heating)?
tabstat Econs2012, by(Gcons2012Valid) s(mean min max n)
88

Ben Anderson's avatar
Ben Anderson committed
89
90
histogram Gcons2012, by(MAIN_HEAT_FUEL, total) name(histo_Gcons2012)
graph export "`rpath'/histo_Gcons2012_by_main_heating_fuel.png", replace
91

Ben Anderson's avatar
Ben Anderson committed
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
tabstat Gcons2012, by(MAIN_HEAT_FUEL) s(n mean min max)

* keep if valid gas & gas = main heat fuel
keep if Gcons2012Valid == "V" & MAIN_HEAT_FUEL == 1

***** random sample ****
* select a random sample but ensure proportions of sampleby are kept
di "* Keeping `sample'% sample by `sampleby'"
sample `sample', by(`sampleby')

tab `sampleby', mi

* log the consumption as it's very skewed -> becomes semi-normal & OK for linear regression
* Gcons = gas
* Econs = Electricity

* create log & deciles
* log - creates a normal distribution
* deciles - avoids the consumption rounding range differences (hopefully)
111
gen log_Gcons2012 = log(Gcons2012)
Ben Anderson's avatar
Ben Anderson committed
112
egen Gcons2012_dec = cut(Gcons2012), group(10)
113
gen log_Econs2012 = log(Econs2012)
Ben Anderson's avatar
Ben Anderson committed
114
egen Econs2012_dec = cut(Econs2012), group(10)
115
116
117
118
119

* combine consumption
* treat missing (gas) as 0
egen Allcons2012 = rowtotal(Gcons2012 Econs2012)

Ben Anderson's avatar
Ben Anderson committed
120
121
*gen log_Allcons2012 = log(Allcons2012)
egen Allcons2012_dec = cut(Allcons2012), group(10)
122

Ben Anderson's avatar
Ben Anderson committed
123
124
125
126
* create log consumption quintiles
*egen quinlog_Allcons2012 = cut(log_Allcons2012), group(5)
*egen quinlog_Gcons2012 = cut(log_Gcons2012), group(5)
*egen quinlog_Econs2012 = cut(log_Econs2012), group(5)
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147

* fix some of the variables

* combine IMD: this is a bit dodgy as they are not strictly comparable
gen ba_imd = IMD_ENG
replace ba_imd = IMD_WALES if ba_imd == .

* must use as category variables!!
* set unkown to be 10 -> adds to end of contrasts so can see effect
replace LOFT_DEPTH = 10 if LOFT_DEPTH == .

* set unkown to be 2020 -> adds to end of contrasts so can see effect
replace BOILER_YEAR = 2020 if BOILER_YEAR == .
replace CWI_YEAR = 2020 if CWI_YEAR == .
replace LI_YEAR  = 2020 if LI_YEAR  == .

* 0 = no
destring BOILER, force replace
replace BOILER = 0 if BOILER == .

* household level vars
Ben Anderson's avatar
Ben Anderson committed
148
149
local generic_hvars "i.BOILER_YEAR i.MAIN_HEAT_FUEL i.LI_YEAR i.LOFT_DEPTH i.FLOOR_AREA_BAND WALL_CONS i.CWI_YEAR i.PROP_TYPE i.PROP_AGE i.EE_BAND"
local generic_hvarsnp "i.BOILER_YEAR i.MAIN_HEAT_FUEL i.LI_YEAR i.LOFT_DEPTH i.FLOOR_AREA_BAND WALL_CONS i.CWI_YEAR i.PROP_AGE i.EE_BAND"
150
151
152
153
154
155
156
157
158
159
160
161
162

* area level vars
local generic_rvars "i.ba_region i.ba_imd"

* define different property types
local ptypes "101 102 103 104 105 106"
local pt101 "detached"
local pt102 "semi"
local pt103 "end_terr"
local pt104 "mid_terr"
local pt105 "bung"
local pt106 "flat"

Ben Anderson's avatar
Ben Anderson committed
163
164
165
* now loop over the energy types & run linear regression models
* NB - the rounding of the consumption values may lead to modelling problems

Ben Anderson's avatar
Ben Anderson committed
166
167
168
169
170
* add Econs Allcons for electricity & sum of both
* rename so graph names don't break
rename log_Gcons2012 lg2012
rename Gcons2012_dec g2012dec
local vars "lg2012 g2012dec"
171
172
173
174
foreach v of local vars {
	* check distributions of original consumption values
	
	* all hhs model
Ben Anderson's avatar
Ben Anderson committed
175
	qui: regress `v' `generic_hvars' ///
176
177
178
		`generic_rvars' ///
		i.BOILER_YEAR
	
Ben Anderson's avatar
Ben Anderson committed
179
	est store `v'
180
181
182
183
184
185
186
187
188
189
	di "* -> `v' estat to test for heteroskedasticity & omitted vars"
	estat ovtest
	estat hettest
	
	* we ought to be testing for linearity too
	di "* -> `v' linktest to test for model specification"
	di "* if p of _hatsq < 0.05 -> mis-spec"
	di "* http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm"
	linktest	
	
Ben Anderson's avatar
Ben Anderson committed
190
191
192
193
194
	di "* test EPC margins for `v'"
	margins EE_BAND
	marginsplot, name(mplot_`v'_EE_BAND)
	graph export "`rpath'/mplot_`v'_EE_BAND-`version'.png", replace
	
195
196
	* models by property type - to see if rsq & coefficients vary
	foreach p of local ptypes {
Ben Anderson's avatar
Ben Anderson committed
197
198
		di "* -> testing `v' for `pt`p''"
		qui: regress `v' `generic_hvarsnp' ///
199
200
201
			`generic_rvars'	///
			i.BOILER_YEAR ///
			if PROP_TYPE == `p'
Ben Anderson's avatar
Ben Anderson committed
202
		est store `v'_`pt`p''
203
204
205
206
207
208
209
210
211
		
		di "* -> `v' 2012 `pt`p'' - estat to test for heteroskedasticity & omitted vars"
		estat ovtest
		estat hettest
		
		* we ought to be testing for linearity too
		di "* -> `v' `pt`p'' linktest to test for model specification"
		di "* if p of _hatsq < 0.05 -> mis-spec"
		di "* http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm"
Ben Anderson's avatar
Ben Anderson committed
212
213
214
215
216
217
		linktest
		di "* test EPC margins for `v' (`pt`p'')"
		margins EE_BAND
		marginsplot, name(mplot_`v'_EE_BAND_`pt`p'')
		graph export "`rpath'/mplot_`v'_EE_BAND_`pt`p''-`version'.png", replace

218
219
	}
	* models for different consumption quintiles - to see if rsq & coefficients vary
Ben Anderson's avatar
Ben Anderson committed
220
	/* doesn't make much sense to do this if using deciles as dependent variable
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
	foreach q of numlist 0/4 {
		di "* -> testing log_`v'2012 for quintile: `q'"
		qui: regress log_`v'2012 `generic_hvars' ///
			`generic_rvars'	///
			i.BOILER_YEAR ///
			if quinlog_`v'2012 == `q'
		est store rlog_`v'2012q`q'
		
		di "* -> quintile: `q' - estat to test for heteroskedasticity & omitted vars"
		estat ovtest
		estat hettest
		
		* we ought to be testing for linearity too
		di "* -> quintile: `q' - linktest"
		di "* if p of _hatsq < 0.05 -> mis-spec"
		di "* http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm"
		linktest	
	}
Ben Anderson's avatar
Ben Anderson committed
239
	*/
240
241
242
243
}

* output all the results - that's a lot of t tests!
* we could put them all out in one file but it would be really hard to find the ones you want!
Ben Anderson's avatar
Ben Anderson committed
244
245
estout lg2012 using "`rpath'/NEED-EULF-2014-log-gas-model-`version'-$S_DATE.txt", replace cells("b se p _star") stats(r2 r2_a N ll)
estout lg2012_* using "`rpath'/NEED-EULF-2014-log-gas-models-by-property-type-`version'-$S_DATE.txt", replace cells("b se p _star") stats(r2 r2_a N ll)
246

Ben Anderson's avatar
Ben Anderson committed
247
248
estout g2012dec using "`rpath'/NEED-EULF-2014-gas-deciles-model-`version'-$S_DATE.txt", replace cells("b se p _star") stats(r2 r2_a N ll)
estout g2012dec_* using "`rpath'/NEED-EULF-2014-gas-deciles-models-by-property-type-`version'-$S_DATE.txt", replace cells("b se p _star") stats(r2 r2_a N ll)
249
250
251
252

di "* Done!"

log close