Commit 10b37ece authored by Ben Anderson's avatar Ben Anderson
Browse files

updated readme

parent c750992b
DECC-git NEED DECC-git NEED
============ ============
Extract & analyse data from the public versions of DECC's NEED dataset Extract & analyse data from the anonymised & released versions of DECC's NEED dataset.
Original 'End User License' version of the data available from: UK DATA ARCHIVE: Study Number 7518 - National Energy Efficiency Data-Framework, 2014 Original 'End User License' version of the data available from: UK DATA ARCHIVE: Study Number 7518 - National Energy Efficiency Data-Framework, 2014
http://discover.ukdataservice.ac.uk/catalogue/?sn=7518 http://discover.ukdataservice.ac.uk/catalogue/?sn=7518
...@@ -9,17 +9,20 @@ http://discover.ukdataservice.ac.uk/catalogue/?sn=7518 ...@@ -9,17 +9,20 @@ http://discover.ukdataservice.ac.uk/catalogue/?sn=7518
For full detailed documentation see https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/332169/need_anonymised_dataset_accompanying_documentation.pdf For full detailed documentation see https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/332169/need_anonymised_dataset_accompanying_documentation.pdf
Notes (mostly to self): Notes (mostly to self):
* gas kwh are weather corrected on distribution zones before delivery to DECC * gas kwh are weather corrected within the 10 DNO distribution zones before delivery to DECC
* This dataset is a sample of just over 4 million households which have had an Energy Performance Certificate from the full NEED 'all dwellings' dataset * The End User License file (EULF) dataset is a sample of just over 4 million households
* It is a semi-random sample of the 8m records with an EPC, it includes only those with valid values on all variables and (especially) valid observations for electricity in 2012. Uncommon property types are over-represented, common types are under-represented and the weight corrects for this * EULF is a semi-random sample of the 8m records which have an Energy Performance Certificate.
* Sample bias is unclear - which kinds of dwellings have an EPC (e.g. flats where frequent churn may be over-represented?) * It includes only those with valid values on key variables (Property Age, Property Type, Floor Area Band and Energy Efficiency Band) and (especially) valid observations for electricity in 2012.
* Records were selected based on the frequency of household type in the dataset relative to the total dwelling stock so that uncommon property types are over-represented, common types are under-represented and the supplied weight corrects for this.
* Implications for sample bias unclear - there may be other systematic biases not capture by the weight?
* UPRN = unique property reference = linkage mechanism (uses AddressBase) * UPRN = unique property reference = linkage mechanism (uses AddressBase)
* Bias caused by linkage failure is unknown although the DECC NEED Data Framework report from 2011 suggest match rates of 94%-100% (https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/209264/Annex_B_-_Quality_Assurance.pdf)
Issues: Issues:
* <fuel>cons<year>valid variable has undefined labels: G, L, M = ? * the E/Gcons*valid variable has some undefined labels (L,M,G):
* 0 = off gas/elec ? * 0 = off gas/elec (documented)
* V = valid reading (gas range 0 - 50,000; elec range = 100 - 25,000) * V = valid reading (documented: gas range 0 - 50,000; electricity range = 100 - 25,000)
* L = large (> 50k or 25k depending?) * L = large? (> 50k or 25k depending?)
* M = missing? * M = missing?
* G = ? * G = ?
* ideally DECC should set missing to -99 to aid re-coding and avoid unpleasant surprises in naive analysis! * ideally DECC should set missing to -99 to aid re-coding and avoid unpleasant surprises in naive analysis!
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment