* Script to: * - analyse DECC's EULF 2014 NEED data to examine distributions etc * Original data available from: UK DATA ARCHIVE: Study Number 7518 - National Energy Efficiency Data-Framework, 2014 * http://discover.ukdataservice.ac.uk/catalogue/?sn=7518 * NB this script uses 2 data files derived from the original data using the 'process' script /* Copyright (C) 2014 University of Southampton Author: Ben Anderson (b.anderson@soton.ac.uk, @dataknut, https://github.com/dataknut) [Energy & Climate Change, Faculty of Engineering & Environment, University of Southampton] This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License (http://choosealicense.com/licenses/gpl-2.0/), or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. #YMMV - http://en.wiktionary.org/wiki/YMMV */ clear all capture noisily log close * written for Mac OSX - remember to change filesystem delimiter for other platforms local home "~/Documents" local proot "`home'/Work/Data/Social Science Datatsets/DECC" * for clam * local proot "`home'/Work/NEED" local dpath "`proot'/NEED/End User Licence File 2014/processed" local rpath "`proot'/results/NEED" local version "v1.1" * set sample local sample "100pc" * quick tests for 2012 local do_2012_desc = 0 * tests for all years using long file - takes a while local do_long_desc = 1 * toggle graph drawing local do_graphs = 0 set more off log using "`rpath'/analyse-NEED-EULF-2014-descriptives-`version'-$S_DATE.smcl", replace if `do_2012_desc' { * use a subsample for speed local sample = "20pc" * first use the wide file for basic descrpitives use "`dpath'/need_eul_may2014_consumptionfile_wide_`sample'.dta", clear * match in the xwave file with the vars we want merge 1:1 HH_ID using "`dpath'/need_eul_may2014_xwavefile_`sample'", keepusing(EE_BAND FLOOR_AREA_BAND PROP_AGE) * distributions for 2012 (to test) * processor intensive local vars "Econs2012 Gcons2012" local tvars "EE_BAND FLOOR_AREA_BAND PROP_AGE" * test values for valid - check for valid 0s for example. This only happens for gas where: * 100 < gcons < 250 so included but rounded to nearest 500 = 0 * elec always rounded to nearest 50 so min should always be 100 foreach v of local vars { tabstat `v', by(`v'Valid) s(n mean semean min max) foreach tv of local tvars { di "***************" di "* Testing `v' by `tv' for `s'% sample" * test values for `tv' - check for 0s for example tabstat `v' if `v'Valid == "V", by(`tv') s(n mean semean min max) tab `v' if `v' < 1000 if `do_graphs' { histogram `v' if `v'Valid == "V", by(`tv') name(h_`tv'_`v'_`sample') graph export "`rpath'/graphs/NEED-EULF-2014-histo_`v'_by_`tv'_`sample'_valid.png", replace graph box `v' if `v'Valid == "V", over(`tv') name(b_`tv'_`v'_`sample') graph export "`rpath'/NEED-EULF-2014-box_`v'_by_`tv'_`sample'_valid.png", replace } } } } if `do_long_desc' { * Now use the pre-processed long form file which contains all years of consumption data but not the constant values (housing charactersitics etc) which are in the xwave file * do this for each random sample of differing sizes as a check * local samples "10 20 30 40 50 100" local samples "10" foreach s of local samples { di "************************" di "* Using `s'% sample" use "`dpath'/need_eul_may2014_consumptionfile_long_`s'pc.dta", clear * set as panel in case it wasn't xtset HH_ID year * examine panel status xtdescribe * distributions for valid obs local vars "Econs Gcons" foreach v of local vars { di "***************" di "* Testing `v' for `s'% sample" * overall xtsum `v' if `v'Valid == "V" * test values for valid - check for valid 0s for example. This only happens for gas where: * 100 < gcons < 250 so included but rounded to nearest 500 = 0 * elec always rounded to nearest 50 so min shoudl always be 100 tabstat `v', by(`v'Valid) s(n mean semean min max) * by year di "* check `v' for 0s (`s'% sample)" table `v' year if `v' < 1000 table `v'Valid year, c(count `v' min `v' mean `v' max `v') if `do_graphs' { histogram `v' if `v'Valid == "V", by(year) name(histo_`s'pc_`v') graph export "`rpath'/NEED-EULF-2014-`s'pc-histo_`v'_by_year_valid.png", replace graph box `v' if `v'Valid == "V", over(year) name(box_`s'pc_`v') graph export "`rpath'/NEED-EULF-2014-`s'pc-box_`v'_by_year_valid.png", replace } * check the panel transitions for each valid gen `v'Validr = 1 if `v'Valid == "V" replace `v'Validr = 2 if `v'Valid == "O" replace `v'Validr = 3 if `v'Valid == "L" replace `v'Validr = 4 if `v'Valid == "G" replace `v'Validr = 5 if `v'Valid == "M" lab var `v'Validr "Recoded `v'Valid" lab def `v'Validr 1 "(V)alid" 2 "(O)ff-gas" 3 "(L)Gas < 100" 4 "(G) Gas > 50,000" 5 "M(issing in source)" lab val `v'Validr `v'Validr * di "Check transitions (`v'Validr)" xttrans `v'Validr, freq } } } di "* Done!" log close