For full detailed documentation see https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/332169/need_anonymised_dataset_accompanying_documentation.pdf
* Full coding details of variables at: https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/315189/need_dataset_look_ups.xlsx
Notes (mostly to self):
* gas kwh are weather corrected within the 10 DNO distribution zones before delivery to DECC
...
...
@@ -15,15 +16,21 @@ Notes (mostly to self):
* It includes only those with valid values on key variables (Property Age, Property Type, Floor Area Band and Energy Efficiency Band) and (especially) valid observations for electricity in 2012.
* Records were selected based on the frequency of household type in the dataset relative to the total dwelling stock so that uncommon property types (e.g. older detached properties) are over-represented and common types (e.g. flats where turnover is high) are under-represented. The supplied weight corrects for this for descriptive analysis.
* Implications for sample bias unclear - there may be other systematic biases not captured by the weight?
* Bias caused by linkage failure is unknown although the DECC NEED Data Framework report from 2011 suggest match rates of 94%-100% (https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/209264/Annex_B_-_Quality_Assurance.pdf)
* UPRN = unique property reference = linkage mechanism across EPCs, gas/electricity data and EST data on energy efficiency installations (uses AddressBase)
* hoping to add PV etc installations soon
* Bias caused by linkage failure is unknown although the DECC NEED Data Framework report from 2013 suggest match rates of 94%-100% (https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/209264/Annex_B_-_Quality_Assurance.pdf)
* Both gas and electricity consumption are rounded and the rounding range ('to nearest n') increases through the distributions (see https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/315189/need_dataset_look_ups.xlsx)
* the E/Gcons*valid variable codes:
* 0 = off gas/elec
* V = valid reading (gas range 100 - 50,000; electricity range = 100 - 25,000)
* L = Gas consumption invalid, less than 100
* M = Gas consumption data is missing in source data
* G = Gas consumption invalid, greater than 50,000
* NB - there are valid gas readings of '0' which presumably were > 100 by < 249 (first gas 'heap' = 'nearest 500')
Issues:
* the E/Gcons*valid variable has some undefined labels (L,M,G):
* 0 = off gas/elec (documented)
* V = valid reading (documented: gas range 0 - 50,000; electricity range = 100 - 25,000)
* L = large? (> 50k or 25k depending?)
* M = missing?
* G = ?
* ideally DECC should set missing to -99 to aid re-coding and avoid unpleasant surprises in naive analysis!
Notes to DECC (!)
* ideally could set missing to -99 to aid re-coding and avoid unpleasant surprises in naive analysis?
* can the consumption rounding be constant through the distributions?
* check coding of Gcons ref 0 values for 'valid' cases?