From 0f4889dcbe5e4d9ae5ef4730d61530982434e953 Mon Sep 17 00:00:00 2001
From: Ben Anderson <b.anderson@soton.ac.uk>
Date: Tue, 16 Sep 2014 09:43:08 +0100
Subject: [PATCH] updated readme following DECC NEED user event

---
 NEED/README.md | 33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/NEED/README.md b/NEED/README.md
index 3107586..b7a06c0 100644
--- a/NEED/README.md
+++ b/NEED/README.md
@@ -3,10 +3,11 @@ DECC-git NEED
 
 Extract & analyse data from the anonymised & released versions of DECC's  NEED dataset.
 
-Original 'End User License' version of the data available from: UK DATA ARCHIVE: Study Number 7518 - National Energy Efficiency Data-Framework, 2014
+Original 'End User License' version of the data:
+* available from: UK DATA ARCHIVE: Study Number 7518 - National Energy Efficiency Data-Framework, 2014
 http://discover.ukdataservice.ac.uk/catalogue/?sn=7518
-
-For full detailed documentation see https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/332169/need_anonymised_dataset_accompanying_documentation.pdf
+* Detailed documentation: https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/332169/need_anonymised_dataset_accompanying_documentation.pdf
+* Full coding details of variables at: https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/315189/need_dataset_look_ups.xlsx
 
 Notes (mostly to self):
 * gas kwh are weather corrected within the 10 DNO distribution zones before delivery to DECC
@@ -15,15 +16,21 @@ Notes (mostly to self):
  * It includes only those with valid values on key variables (Property Age, Property Type, Floor Area Band and Energy Efficiency Band) and (especially) valid observations for electricity in 2012. 
  * Records were selected based on the frequency of household type in the dataset relative to the total dwelling stock so that uncommon property types (e.g. older detached properties) are over-represented and common types (e.g. flats where turnover is high) are under-represented. The supplied weight corrects for this for descriptive analysis. 
  * Implications for sample bias unclear - there may be other systematic biases not captured by the weight?
-* UPRN = unique property reference = linkage mechanism (uses AddressBase)
-* Bias caused by linkage failure is unknown although the DECC NEED Data Framework report from 2011 suggest match rates of 94%-100% (https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/209264/Annex_B_-_Quality_Assurance.pdf)
+* UPRN = unique property reference = linkage mechanism across EPCs, gas/electricity data and EST data on energy efficiency installations (uses AddressBase)
+ * hoping to add PV etc installations soon
+* Bias caused by linkage failure is unknown although the DECC NEED Data Framework report from 2013 suggest match rates of 94%-100% (https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/209264/Annex_B_-_Quality_Assurance.pdf)
+* Both gas and electricity consumption are rounded and the rounding range ('to nearest n') increases through the distributions (see https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/315189/need_dataset_look_ups.xlsx)
+* the E/Gcons*valid variable codes:
+ * 0 = off gas/elec
+ * V = valid reading (gas range 100 - 50,000; electricity range = 100 - 25,000)
+ * L = Gas consumption invalid, less than 100
+ * M = Gas consumption data is missing in source data
+ * G = Gas consumption invalid, greater than 50,000
+ * NB - there are valid gas readings of '0' which presumably were > 100 by < 249 (first gas 'heap' = 'nearest 500')
 
-Issues:
-* the E/Gcons*valid variable has some undefined labels (L,M,G):
- * 0 = off gas/elec (documented)
- * V = valid reading  (documented: gas range 0 - 50,000; electricity range = 100 - 25,000)
- * L = large? (> 50k or 25k depending?)
- * M = missing?
- * G = ?
-* ideally DECC should set missing to -99 to aid re-coding and avoid unpleasant surprises in naive analysis!
+Notes to DECC (!)
+* ideally could set missing to -99 to aid re-coding and avoid unpleasant surprises in naive analysis?
+* can the consumption rounding be constant through the distributions?
+* check coding of Gcons ref 0 values for 'valid' cases?
 
+YMMV
\ No newline at end of file
-- 
GitLab