Skip to content
Snippets Groups Projects
Select Git revision
  • dcadbc20299f50ad3f17cd493ab835d1c8f4f554
  • master default
2 results

README.md

Blame
  • To learn more about this project, read the wiki.
    README.md 3.42 KiB

    DECC-git NEED

    Extract & analyse data from the anonymised & released versions of DECC's NEED dataset.

    Original 'End User License' version of the data:

    Notes (mostly to self):

    • gas kwh are weather corrected within the 10 DNO distribution zones before delivery to DECC
    • The End User License file (EULF) dataset is a sample of just over 4 million households
    • EULF is a semi-random sample of the 8m records which have an Energy Performance Certificate.
    • It includes only those with valid values on key variables (Property Age, Property Type, Floor Area Band and Energy Efficiency Band) and (especially) valid observations for electricity in 2012.
    • Records were selected based on the frequency of household type in the dataset relative to the total dwelling stock so that uncommon property types (e.g. older detached properties) are over-represented and common types (e.g. flats where turnover is high) are under-represented. The supplied weight corrects for this for descriptive analysis.
    • Implications for sample bias unclear - there may be other systematic biases not captured by the weight?
    • UPRN = unique property reference = linkage mechanism across EPCs, gas/electricity data and EST data on energy efficiency installations (uses AddressBase)
    • hoping to add PV etc installations soon
    • Bias caused by linkage failure is unknown although the DECC NEED Data Framework report from 2013 suggest match rates of 94%-100% (https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/209264/Annex_B_-_Quality_Assurance.pdf)
    • Both gas and electricity consumption are rounded and the rounding range ('to nearest n') increases through the distributions (see https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/315189/need_dataset_look_ups.xlsx). The reasons for this are explained in the consultation response at https://www.gov.uk/government/consultations/national-energy-efficiency-data-framework-making-data-available
    • the Gcons*valid variable codes:
    • G = Gas consumption invalid, greater than 50,000
    • L = Gas consumption invalid, less than 100
    • M = Gas consumption data is missing in source data
    • 0 = Property does not have a gas connection
    • V = Valid gas consumption (between 100 and 50,000 inclusive)
    • NB - there are valid gas readings of '0' which presumably were > 100 but < 249 (first gas 'heap' = 'nearest 500')
    • the Econs*valid variable codes:
    • G Electricity consumption invalid, greater than 25,000 (DECC lookup table says 50,000) * L Electricity consumption invalid, less than 100 * M Electricity consumption data is missing in source dataset
    • V Valid electricity consumption (between 100 and 25,000 inclusive)Notes to DECC (!)
    • ideally could set missing to -99 to aid re-coding and avoid unpleasant surprises in naive analysis?
    • can the consumption rounding be constant through the distributions?
    • check coding of Gcons ref 0 values for 'valid' cases?
    • distinguish between electric & 'other' heating in 'main heating fuel'?

    YMMV