analyse-NEED-EULF-2014-models-v2.0.do 8.72 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
* Script to analyse DECC's NEED data to:
* investigate % variance of energy consumption due to dwelling type variables as a way to infer the % of variance due to people

* NB this script uses 2 data files derived from the original data using the 'process' script

* Original data available from: UK DATA ARCHIVE: Study Number 7518 - National Energy Efficiency Data-Framework, 2014
* http://discover.ukdataservice.ac.uk/catalogue/?sn=7518

* Ben Anderson, Energy & Climate Change, Faculty of Engineering & Environment, University of Southampton
* b.anderson@soton.ac.uk
* (c) University of Southampton

* Unless there is a different license file in the folder in which this script is found, the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license applies
* http://creativecommons.org/licenses/by-nc/4.0/

clear all

capture noisily log close

* written for Mac OSX - remember to change filesystem delimiter for other platforms
local home "/Users/ben/Documents"
local proot "`home'/Work/Data/Social Science Datatsets/DECC"
* for clam
* local proot "`home'/Work/NEED"
local dpath "`proot'/NEED/End User Licence File 2014/processed"
Ben Anderson's avatar
Ben Anderson committed
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
local rpath "`proot'/results/NEED"

*local verrsion "1.0"
* initial models - all households for electricity models

*local verrsion "1.1"
* restrict to gas only households to avoid complications of:
* - primary electric heating (presumably)
* - oil heating

*local version "v2a_1pc"
*local sample 1
*local sampleby "EE_BAND PROP_TYPE"
* changed from using log consumption to consumption decile to avoid complications due to variable rounding ranges in original data (see readme)
* restricted analysis to households where gas is main heat source as it is better predicted by variables included & is more relevant to EPC (heat)
* uses 1% sample (c 30k) making sure keep proportions of property type and EE_Band to see if linktest fails with smaller n

*local version "v2b_10pc"
*local sample 10
*local sampleby "EE_BAND PROP_TYPE"
* uses 10% sample (c 300k) making sure keep proportions of property type and EE_Band to see if margin plots and co-efficients are the same 
* (linktest etc will probably now fail due to larger n)

local version "v2c_full"
local sample 10
local sampleby "EE_BAND PROP_TYPE"
Ben Anderson's avatar
Ben Anderson committed
52
* uses full sample (c 4m) to see if margin plots and co-efficients are the same 
Ben Anderson's avatar
Ben Anderson committed
53
* (linktest etc will probably now fail due to larger n)
54
55
56
57
58
59
60
61

set more off

log using "`rpath'/analyse-NEED-EULF-2014-models-`version'-$S_DATE.smcl", replace

* use the pre-processed wide form file which contains all years of consumption data but not the constant values which are in the xwave file
use "`dpath'/need_eul_may2014_consumptionfile_wide.dta", clear

Ben Anderson's avatar
Ben Anderson committed
62
* we're going to use 2012 data only
63
64
65

keep HH_ID *2012*

Ben Anderson's avatar
Ben Anderson committed
66
67
* merge in the pre-processed cross-year fixed values file
merge 1:1 HH_ID using "`dpath'/need_eul_may2014_xwavefile.dta"
68
69

* check what's valid
Ben Anderson's avatar
Ben Anderson committed
70
tab Gcons2012Valid Econs2012Valid, mi // O = off gas, V = valid, L = too low, G = too big, M = missing
71
tabstat Gcons2012, by(Gcons2012Valid) s(mean min max n)
Ben Anderson's avatar
Ben Anderson committed
72
73
* do off-gas use a lot more electricty (heating)?
tabstat Econs2012, by(Gcons2012Valid) s(mean min max n)
74

Ben Anderson's avatar
Ben Anderson committed
75
76
histogram Gcons2012, by(MAIN_HEAT_FUEL, total) name(histo_Gcons2012)
graph export "`rpath'/histo_Gcons2012_by_main_heating_fuel.png", replace
77

Ben Anderson's avatar
Ben Anderson committed
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
tabstat Gcons2012, by(MAIN_HEAT_FUEL) s(n mean min max)

* keep if valid gas & gas = main heat fuel
keep if Gcons2012Valid == "V" & MAIN_HEAT_FUEL == 1

***** random sample ****
* select a random sample but ensure proportions of sampleby are kept
di "* Keeping `sample'% sample by `sampleby'"
sample `sample', by(`sampleby')

tab `sampleby', mi

* log the consumption as it's very skewed -> becomes semi-normal & OK for linear regression
* Gcons = gas
* Econs = Electricity

* create log & deciles
* log - creates a normal distribution
* deciles - avoids the consumption rounding range differences (hopefully)
97
gen log_Gcons2012 = log(Gcons2012)
Ben Anderson's avatar
Ben Anderson committed
98
egen Gcons2012_dec = cut(Gcons2012), group(10)
99
gen log_Econs2012 = log(Econs2012)
Ben Anderson's avatar
Ben Anderson committed
100
egen Econs2012_dec = cut(Econs2012), group(10)
101
102
103
104
105

* combine consumption
* treat missing (gas) as 0
egen Allcons2012 = rowtotal(Gcons2012 Econs2012)

Ben Anderson's avatar
Ben Anderson committed
106
107
*gen log_Allcons2012 = log(Allcons2012)
egen Allcons2012_dec = cut(Allcons2012), group(10)
108

Ben Anderson's avatar
Ben Anderson committed
109
110
111
112
* create log consumption quintiles
*egen quinlog_Allcons2012 = cut(log_Allcons2012), group(5)
*egen quinlog_Gcons2012 = cut(log_Gcons2012), group(5)
*egen quinlog_Econs2012 = cut(log_Econs2012), group(5)
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133

* fix some of the variables

* combine IMD: this is a bit dodgy as they are not strictly comparable
gen ba_imd = IMD_ENG
replace ba_imd = IMD_WALES if ba_imd == .

* must use as category variables!!
* set unkown to be 10 -> adds to end of contrasts so can see effect
replace LOFT_DEPTH = 10 if LOFT_DEPTH == .

* set unkown to be 2020 -> adds to end of contrasts so can see effect
replace BOILER_YEAR = 2020 if BOILER_YEAR == .
replace CWI_YEAR = 2020 if CWI_YEAR == .
replace LI_YEAR  = 2020 if LI_YEAR  == .

* 0 = no
destring BOILER, force replace
replace BOILER = 0 if BOILER == .

* household level vars
Ben Anderson's avatar
Ben Anderson committed
134
135
local generic_hvars "i.BOILER_YEAR i.MAIN_HEAT_FUEL i.LI_YEAR i.LOFT_DEPTH i.FLOOR_AREA_BAND WALL_CONS i.CWI_YEAR i.PROP_TYPE i.PROP_AGE i.EE_BAND"
local generic_hvarsnp "i.BOILER_YEAR i.MAIN_HEAT_FUEL i.LI_YEAR i.LOFT_DEPTH i.FLOOR_AREA_BAND WALL_CONS i.CWI_YEAR i.PROP_AGE i.EE_BAND"
136
137
138
139
140
141
142
143
144
145
146
147
148

* area level vars
local generic_rvars "i.ba_region i.ba_imd"

* define different property types
local ptypes "101 102 103 104 105 106"
local pt101 "detached"
local pt102 "semi"
local pt103 "end_terr"
local pt104 "mid_terr"
local pt105 "bung"
local pt106 "flat"

Ben Anderson's avatar
Ben Anderson committed
149
150
151
* now loop over the energy types & run linear regression models
* NB - the rounding of the consumption values may lead to modelling problems

Ben Anderson's avatar
Ben Anderson committed
152
153
154
155
156
* add Econs Allcons for electricity & sum of both
* rename so graph names don't break
rename log_Gcons2012 lg2012
rename Gcons2012_dec g2012dec
local vars "lg2012 g2012dec"
157
158
159
160
foreach v of local vars {
	* check distributions of original consumption values
	
	* all hhs model
Ben Anderson's avatar
Ben Anderson committed
161
	qui: regress `v' `generic_hvars' ///
162
163
164
		`generic_rvars' ///
		i.BOILER_YEAR
	
Ben Anderson's avatar
Ben Anderson committed
165
	est store `v'
166
167
168
169
170
171
172
173
174
175
	di "* -> `v' estat to test for heteroskedasticity & omitted vars"
	estat ovtest
	estat hettest
	
	* we ought to be testing for linearity too
	di "* -> `v' linktest to test for model specification"
	di "* if p of _hatsq < 0.05 -> mis-spec"
	di "* http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm"
	linktest	
	
Ben Anderson's avatar
Ben Anderson committed
176
177
178
179
180
	di "* test EPC margins for `v'"
	margins EE_BAND
	marginsplot, name(mplot_`v'_EE_BAND)
	graph export "`rpath'/mplot_`v'_EE_BAND-`version'.png", replace
	
181
182
	* models by property type - to see if rsq & coefficients vary
	foreach p of local ptypes {
Ben Anderson's avatar
Ben Anderson committed
183
184
		di "* -> testing `v' for `pt`p''"
		qui: regress `v' `generic_hvarsnp' ///
185
186
187
			`generic_rvars'	///
			i.BOILER_YEAR ///
			if PROP_TYPE == `p'
Ben Anderson's avatar
Ben Anderson committed
188
		est store `v'_`pt`p''
189
190
191
192
193
194
195
196
197
		
		di "* -> `v' 2012 `pt`p'' - estat to test for heteroskedasticity & omitted vars"
		estat ovtest
		estat hettest
		
		* we ought to be testing for linearity too
		di "* -> `v' `pt`p'' linktest to test for model specification"
		di "* if p of _hatsq < 0.05 -> mis-spec"
		di "* http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm"
Ben Anderson's avatar
Ben Anderson committed
198
199
200
201
202
203
		linktest
		di "* test EPC margins for `v' (`pt`p'')"
		margins EE_BAND
		marginsplot, name(mplot_`v'_EE_BAND_`pt`p'')
		graph export "`rpath'/mplot_`v'_EE_BAND_`pt`p''-`version'.png", replace

204
205
	}
	* models for different consumption quintiles - to see if rsq & coefficients vary
Ben Anderson's avatar
Ben Anderson committed
206
	/* doesn't make much sense to do this if using deciles as dependent variable
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
	foreach q of numlist 0/4 {
		di "* -> testing log_`v'2012 for quintile: `q'"
		qui: regress log_`v'2012 `generic_hvars' ///
			`generic_rvars'	///
			i.BOILER_YEAR ///
			if quinlog_`v'2012 == `q'
		est store rlog_`v'2012q`q'
		
		di "* -> quintile: `q' - estat to test for heteroskedasticity & omitted vars"
		estat ovtest
		estat hettest
		
		* we ought to be testing for linearity too
		di "* -> quintile: `q' - linktest"
		di "* if p of _hatsq < 0.05 -> mis-spec"
		di "* http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm"
		linktest	
	}
Ben Anderson's avatar
Ben Anderson committed
225
	*/
226
227
228
229
}

* output all the results - that's a lot of t tests!
* we could put them all out in one file but it would be really hard to find the ones you want!
Ben Anderson's avatar
Ben Anderson committed
230
231
estout lg2012 using "`rpath'/NEED-EULF-2014-log-gas-model-`version'-$S_DATE.txt", replace cells("b se p _star") stats(r2 r2_a N ll)
estout lg2012_* using "`rpath'/NEED-EULF-2014-log-gas-models-by-property-type-`version'-$S_DATE.txt", replace cells("b se p _star") stats(r2 r2_a N ll)
232

Ben Anderson's avatar
Ben Anderson committed
233
234
estout g2012dec using "`rpath'/NEED-EULF-2014-gas-deciles-model-`version'-$S_DATE.txt", replace cells("b se p _star") stats(r2 r2_a N ll)
estout g2012dec_* using "`rpath'/NEED-EULF-2014-gas-deciles-models-by-property-type-`version'-$S_DATE.txt", replace cells("b se p _star") stats(r2 r2_a N ll)
235
236
237
238

di "* Done!"

log close