diff --git a/saveWeights/createSAVEWeightsIPF.Rmd b/saveWeights/createSAVEWeightsIPF.Rmd
new file mode 100644
index 0000000000000000000000000000000000000000..3ddfffad9b2746035a556655421c6f372291c6d0
--- /dev/null
+++ b/saveWeights/createSAVEWeightsIPF.Rmd
@@ -0,0 +1,1833 @@
+---
+title: 'SAVE SDRC 2.2: Creation of Small Area Household Weights using IPF'
+author: "Ben Anderson (b.anderson@soton.ac.uk `@dataknut`), Tom Rushby (t.w.rushby@soton.ac.uk,  `@tom_rushby`)"
+date: 'Last run at: `r Sys.time()`'
+output:
+  html_document:
+    keep_md: yes
+    number_sections: yes
+    self_contained: no
+    toc: yes
+    toc_depth: 4
+    toc_float: yes
+  pdf_document:
+    number_sections: yes
+    toc: yes
+    toc_depth: 4
+    toc_float: yes
+  word_document:
+    fig_caption: yes
+    toc: yes
+    toc_depth: 4
+bibliography: ~/bibliography.bib
+---
+```{r knitrSetUp, include=FALSE}
+knitr::opts_chunk$set(echo = FALSE) # echo code so reader can see what is happening
+knitr::opts_chunk$set(warning = FALSE)
+knitr::opts_chunk$set(message = FALSE)
+knitr::opts_chunk$set(fig_caption = TRUE)
+knitr::opts_chunk$set(fig_height = 6) # default, make it bigger to stretch vertical axis
+knitr::opts_chunk$set(fig_width = 8) # full width
+knitr::opts_chunk$set(tidy = TRUE) # tidy up code in case echo = TRUE
+```
+
+```{r codeSetup, include=FALSE}
+# Housekeeping ----
+
+# Set start time ----
+startTime <- proc.time()
+
+# Where are we?
+sysInfo <- Sys.info()
+sysName <- sysInfo[[1]]
+nodeName <- sysInfo[[4]]
+userName <- sysInfo[[7]]
+
+
+projLoc <- here::here()
+
+# Functions ----
+
+# Load libraries ----
+library(SAVEr)
+# NB libraries required by saveFunctions.R should already be loaded
+reqLibsLocal <- c("data.table", # fast data manipulation
+                  "ggplot2", # slick graphs
+                  "readr", # for read_csv
+                  "lubridate", # for date manipulation
+                  "stargazer", # for regression tables
+                  "broom", # tidy regression results
+                  "knitr" # for kable
+                     )
+
+# Use Luke's function to require/install/load
+SAVEr::myRequiredPackages(reqLibsLocal)
+
+# Parameters ----
+# Do this here AFTER loading libraries
+print(paste0("Loading parameters from ", projLoc,"/saveParameters.R"))
+source(paste0(projLoc,"/saveParameters.R"))
+
+# give feedback
+print(paste0("Running on ", sysName, " with projLoc = ", projLoc)) # sysName is set in impetusParameters.R
+
+# top level data paths are set in impetusParameters.R
+print(paste0("User: ",userName))
+print(paste0("Platform: ",sysName))
+print(paste0("Input data path: ",dPath))
+
+tp1Clamp15mJan2017F <- paste0(dPath,"processed/jan2017clamp15minuteData.csv.gz")
+
+
+# Set dates ----
+# these are used throughout the code
+
+datePeriod <- interval(trial1Start,trial1End)
+genericAbout <- paste0(projLoc,"/genericAbout.Rmd")
+
+```
+
+# About
+
+```{r genericAbout, child = genericAbout}
+# insert here - NB have to set in header
+```
+ 
+
+## Contributions
+
+Please note that authorship is alphabetical. Contributions are listed below.
+
+Ben Anderson (b.anderson@soton.ac.uk `@dataknut`)
+
+ * 
+
+Tom Rushby (t.w.rushby@soton.ac.uk `@tom_rushby`)
+
+ * 
+
+  
+## Code
+
+ * https://github.com/dataknut/SAVE (currently private)
+
+## Citation
+
+If you wish to refer to any of the material from this report please cite as:
+
+ * Anderson, B. and Rushby, T. (`r 1900 + as.POSIXlt(Sys.Date())$year`) _SAVE SDRC 2.2: Spatial Microsimulation_, University of Southampton: Southampton, UK.
+ 
+## Report circulation:
+
+ * Restricted to: [Solent Achieving Value from Efficiency](http://www.energy.soton.ac.uk/save-solent-achieving-value-from-efficiency/) project partners and contractors.
+
+# Executive Summary
+
+Creates the household weights for each area using IPF. Also carries out some tests and then saves the weights out for future use. Advise only run this code again if ipf inputs change as it takes a while to run.
+
+\newpage
+
+# Introduction
+
+Background to what we are trying to do & the methods (revise SDRC 2.1):
+
+Re-use some material from SDRC2.1 here explaining how the method works.
+
+Methodological:
+
+ * IPF - iterative proportional fitting: creates a non-integer weight for each household in each small area but does not allow the selection of the n best-fit households. Instead we use weighted statistics to summarise the results (e.g. kWh profiles);
+ * 'Truncate, replicate, sample': Lovelace & Ballas, creates an integer weight from the ipf fractional weights for the best fitting n households where n = the number of households needed per small area. http://www.sciencedirect.com/science/article/pii/S0198971513000240#
+ 
+Small areas:
+
+ * LSOAs - originally planned
+ * OAs - SSEN preferred  `r area <- "OA"`
+
+# Data used
+
+ * BMG fieldwork outcomes: `r bmgOrigResponseFile`;
+ * BMG recruitment survey: `r bmgFullRecruitmentSurveyFile`;
+ * BMG TP1 update survey: `r bmgTP1UpdateSurveyFile`;
+ * 15m kWh clamp data for January 2017 as input to a constraint selection model;
+ * Census 2011 small area tables for LSOAs/OAs
+
+
+```{r Load household data}
+# load & process data
+hhAttributesDT <- ba_SAVEcreateHhAttributesDT()
+setkey(hhAttributesDT, bmg_id)
+```
+
+```{r Load january 2017 clamp data}
+# load & process data
+print(paste0("Loading required clamp data: ", tp1Clamp15mJan2017F))
+system.time(tp1Clamp15mDT <- as.data.table(read_csv(tp1Clamp15mJan2017F, progress = FALSE))) # no progress = faster in nb mode 
+
+# child code expectes full tp1 data but will work with this
+tp1Clamp15mDT <- tp1Clamp15mDT[, bmg_id := as.character(bmg_id)]
+
+# > add the dates etc that the child code expects ----
+print("Adding dates & half hours")
+
+tp1Clamp15mDT <- tp1Clamp15mDT[, obsDateTime30m := floor_date(obsDateTime, unit = "30 minutes")]
+
+tp1Clamp15mDT <- tp1Clamp15mDT[, obsMonth := month(obsDateTime, label = TRUE)] # requires lubridate
+
+print("Creating fake time")
+tp1Clamp15mDT <- tp1Clamp15mDT[, obsHour := as.POSIXlt(obsDateTime)$hour]
+tp1Clamp15mDT <- tp1Clamp15mDT[, obsHourMin := format(obsDateTime, format = "%H:%M")] # makes graphs easier
+tp1Clamp15mDT <- tp1Clamp15mDT[, obsHourMin := as.POSIXct(strptime(obsHourMin, "%H:%M") # convert the char
+                                               ) 
+                      ] # makes graphs easier (will set date to 'today') - slow - must be a lubridatey way to do this
+
+#>  set a useful time period for filtering purposes ----
+print("Creating 4-8 etc periods")
+tp1Clamp15mDT <- tp1Clamp15mDT[, timePeriod := ifelse(obsHourMin > as.POSIXct('12:01', format='%H:%M') & obsHourMin < as.POSIXct('18:01', format='%H:%M'), 
+                                                      "1: 12:00 - 16:00", 
+                                                      "Other")
+                               ] # prior to evening peak 4-8
+
+tp1Clamp15mDT <- tp1Clamp15mDT[, timePeriod := ifelse(obsHourMin > as.POSIXct('16:01', format='%H:%M') & obsHourMin < as.POSIXct('20:01', format='%H:%M'), 
+                                                      "2: 16:00 - 20:00", 
+                                                      timePeriod)
+                               ] # evening peak 4-8
+tp1Clamp15mDT <- tp1Clamp15mDT[, timePeriod := ifelse(obsHourMin > as.POSIXct('20:01', format='%H:%M') & obsHourMin <= as.POSIXct('23:31', format='%H:%M'), 
+                                                      "3: 20:00 - 23:30", 
+                                                      timePeriod)
+                               ] # after evening peak 4-8
+
+
+```
+
+
+```{r ipf library and version setup}
+# only effects filenames for saving weights, constraints & survey data
+version <- "_sdrc2.2_" # sdrc 2.2
+
+# special libraries
+# NB libraries required by saveFunctions.R should already be loaded
+reqLibsSMS <- c("MASS", # for stepAIC(); not loaded earlier as conflicts with evening consumption ratio calcs
+                  "ipfp", # load ipfp library 
+                  "survey" # may already be loaded
+                  )
+# Use Luke's function to require/install/load
+lb_myRequiredPackages(reqLibsSMS,"http://cran.rstudio.com/")
+
+```
+
+# Establish optimum constraint variables
+
+Here we decide which of the 'census' variables we have in the survey are the best 'constraints' to use in the weighitng process. Normal practice is to run a regression model to isolate the 'best predictors'. 
+
+Outcome variable:
+
+ * evening peak (16:00 - 20:00) kWh per household operationalised as mean(kWh) for January 2017 over all non-zero kWh values
+
+Best fit models...
+
+Usually do this by stepwise entry of candidate variables and then deducing optimum combinations and order using contributions to overall model r sq changes. Candidate variables must be in Census and SAVE survey and SAVE data must be codeable in same form as Census small area counts. Using step() to try to do this.
+
+Candidate SAVE variables (HRP):
+ 
+| Variable        | Available at OA           | Available at LSOA  |
+|---------------|:-------------:|:------:|
+| HRP employment status      | Y | Y |
+| HRP qualifications - not yet tested      | Not for HRPs      |   ? |
+| HRP NS-SEC | Y      |    Y |
+| HRP ethnicity | Y | Y|
+| HRP religion - not yet tested | Y | Y|
+| HRP age | Y | Y |
+
+Candidate SAVE variables (household):
+ 
+ | Variable | Available at OA | Available at LSOA |
+ |----|----|----|
+ | N people| Y | Y | 
+ | Presence of children| Y (composition)| Y | 
+ | Main heat source (gas vs elec vs other)| No |Y | 
+ | Dwelling type|Y (accomodation) | Y | 
+ | N rooms| Y | Y| 
+ | Tenure| Y | Y | 
+ | N cars| Y | Y | 
+ | Composition (married/single/civil partnership etc) - untested|Y| Y|
+ | Presence of gas (DECC LSOA data)| No | Y|
+
+That's quite a long list - 5 or 6 in total is usually optimal. Next we test the ability of these candidates to predict (not cause - just predict) mean half-hourly kWh for the period 16:00 - 20:00.
+
+First we test the distribution of the consumption values using a simple table.
+
+```{r set up clamp data to model best fit constraints}
+# this should be a re-run of the model we used to suggest weighting dimensions to BMG
+
+# create a table with mean kWh for 4-8 all of jan 2017
+tp1Clamp15mDT <- tp1Clamp15mDT[, obsDateTime30m := floor_date(obsDateTime, unit = "30 minutes")]
+modelJan30min4_8DT <- tp1Clamp15mDT[consumptionWh != 0 & obsMonth == "Jan" & timePeriod == "2: 16:00 - 20:00", 
+                                    .(halfHourWh = sum(consumptionWh)), 
+                                    keyby = .(bmg_id, obsDateTime30m)]
+
+modelJan4_8DT <- modelJan30min4_8DT[, .(meanHalfHourkWh = mean(halfHourWh/1000)), keyby = .(bmg_id)]
+
+st <- summary(modelJan4_8DT, digits = 2)
+
+kable(caption = "Summary of half-hourly consumption kWh used in constraint selection model", st)
+```
+
+Next, we test the distributions using histograms (see below).
+
+```{r check kWh distributions}
+# check the distributions of kWh
+myCaption <- "SAVE Clamp data (log mean kWh per half hour) for January 2017, 16:00 - 20:00"
+
+ggplot(modelJan4_8DT, aes(x = meanHalfHourkWh)) +
+  geom_histogram() +
+  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + 
+  labs(caption = myCaption,
+      x = "Mean kWh per half hour",
+      y = "Count"
+    )
+
+ggplot(modelJan4_8DT, aes(x = log(meanHalfHourkWh))) +
+  geom_histogram() +
+  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + 
+  geom_vline(xintercept = -4, colour = "red") +
+  annotate("text", x = -4, y = 400, label = "Outlier threshold", colour = "red", hjust= 1.1) +
+  labs(caption = myCaption,
+      x = "Log (Mean kWh) per half hour",
+      y = "Count"
+    )
+
+# set cut threshold for outliers based on previous histogram
+modelJan4_8DT <- modelJan4_8DT[, cut := ifelse(log(meanHalfHourkWh) < -4, 1,0)]
+```
+
+Clearly the untransformed consumption values are very skewed so we use log(mean) kWh for the models to avoid the problem of highly skewed kWh values. We also cut the transformed values at -4 (see hisogram above) to avoid the outliers with non-zero but extrenmely low consumption values.
+
+```{r add hh attributes to model dt}
+# add household attributes to the model DT
+# only add the ones you want to use to save confusion
+
+modelJan4_8DT <- hhAttributesDT[, .(bmg_id, bmgGroup.respF,
+                                    ba_Q2D_HRPemplType.latest, ba_Q22_HRPethnicity, ba_Q2B_HRPage.fullSurvey, ba_hrpNSSEC7,ba_hrpNSSEC3,
+                                    ba_censusNpeople.latest,
+                                    ba_Q3_6_mainsGas,ba_heatSourceReduced,ba_censusNcars,
+                                    ba_Q8_2_dwelling,ba_Q8_2_dwellingRecoded,
+                                    ba_nrooms,ba_censusTenure,ba_presenceChildren.latest,ba_Q8_27_Income
+                                    )][modelJan4_8DT]
+
+
+```
+
+
+## Estimate regression models
+
+Having established the need to use log(mean(kWh)) we now test two models.  Our first model is just a 'kitchen sink' - we throw everything in. Our second includes only those we know we can get at OA level. 
+
+```{r test survey HRP NS-SEC}
+t <- table(hhAttributesDT$ba_hrpNSSEC3, hhAttributesDT$ba_hrpNSSEC7, useNA = "always")
+```
+
+Note that we are currently unclear as to the validity of our [coding](http://www.ons.gov.uk/ons/guide-method/classifications/current-standard-classifications/soc2010/soc2010-volume-3-ns-sec--rebased-on-soc2010--user-manual/section-14--deriving-the-ns-sec--self-coded-method.pdf) of the NS-SEC categories based on the survey responses and so are not currently using this constraint in the models. Once we have confirmation from BMG that the coding is correct (or BMG supply a correctly coded NS-SEC) then it will be added to the models.
+
+```{r estmate lm models}
+# now test two versions of the model
+
+# model 1 = kitchen sink
+mod1 <- "log(meanHalfHourkWh) ~ ba_censusNpeople.latest + ba_presenceChildren.latest + ba_nrooms + ba_heatSourceReduced + ba_Q3_6_mainsGas + 
+ba_Q8_2_dwellingRecoded + ba_censusTenure + ba_censusNcars + ba_Q8_27_Income +
+ba_Q22_HRPethnicity + ba_Q2D_HRPemplType.latest + ba_Q2B_HRPage.fullSurvey"
+
+# model 2 = only constraints we can get at OA level
+mod2 <- "log(meanHalfHourkWh) ~ ba_censusNpeople.latest + ba_presenceChildren.latest + ba_nrooms + ba_Q8_2_dwellingRecoded + ba_censusTenure + ba_censusNcars + ba_Q22_HRPethnicity + ba_Q2D_HRPemplType.latest + ba_Q2B_HRPage.fullSurvey"
+
+# ideally final model is a refined version of model 2
+# run after excluding outliers (see above for thresholds)
+
+model1 <- lm( as.formula(mod1), data = modelJan4_8DT[cut == 0])
+
+model2 <- lm( as.formula(mod2),  data = modelJan4_8DT[cut == 0])
+
+
+
+  # Visualise the model results using ggplot
+  tidyM1 <- tidy(model1)
+  tidyM1$model <- "M1: Kitchen sink"
+  tidyM1$version <- "v1"
+  
+
+  
+  tidyM2 <- tidy(model2)
+  tidyM2$model <- "M2: OA level constraints"
+  tidyM2$version <- "v1"
+
+  
+  resultsDT <- data.table(rbind(tidyM1,tidyM2))
+  
+  
+  myCaption <- paste0("SAVE Clamp data (log mean kWh per half hour) for January 2017 16:00 - 20:00\nIntercept not shown for clarity\nError bars = 95% CI for estimate")
+  
+  # make the plot 
+  myPlot <- ba_SAVEvisRegResults(resultsDT, myCaption)
+  
+  myPlot
+```
+
+Stargazer output:
+
+```{r constraint model stargazer output, results='asis'}
+stargazer(model1, model2,
+          ci = TRUE, 
+          single.row = TRUE,type = "html") # html text latex
+```
+
+### Regression model diagnostics
+
+We use the following regression diagnostics for each model to assess their validity.
+
+```{r constraint tests regression model diagnostics}
+
+models <- c("model1", "model2")
+
+for(m in models){
+  print("## -- Start of diagnostics -- ##")
+    print(paste0("-> Model diagnostics for: ", m))
+      ba_SAVElmDiagnostics(get(m))
+    print("## -- End of diagnostics -- ##")
+}
+```
+
+From the diagnostics, we note results of two variables:
+
+ * the age of household representative person (a_Q2B_HRPage.fullSurvey) is below the threshold value for tolerance in all models;
+ * the employment type of household representative person (ba_Q2D_HRPemplType.latest) is below threshold value for tolerance in model1;
+
+These results do not cause concern as these variables are 'thrown out' in the constraint selection process that follows.
+
+> are they? TR, yes orginally not included but now they are. :o( so what to do about this?
+
+## Selecting final constraints
+
+However we really need to identify the most parsimonious set of constraints for use in the ipf process (X ref X) and so we run both of these models trhough the stepAIC() function to select the final model form which has the constraint variables that provide the most powerful model. We are well aware of the criticism of selection models of this type when used for inference but in this case we are simply looking for the strongest correlations - causation is not of particular interest. We test both model 1 (kitchen sink) and model 2 to understand the extent to which not having some of the variables in model 1 at OA level might cause problems.
+
+First we run the 'kitchen sink' model 1.
+
+```{r stepwise constraint selection model 1}
+
+max.model1 <- lm(  as.formula(mod1),  data = modelJan4_8DT[cut == 0])
+
+max.model2 <- lm(  as.formula(mod2),  data = modelJan4_8DT[cut == 0])
+# use step() to give use a stepwise model
+
+# use step() to give use a stepwise model
+#bwd.model <- step(max.model, direction="backward", trace = 0)
+# this produces the full model results (as above)
+#summary(bwd.model)
+# we wanted the 'best' model
+
+# try using stepAIC from MASS
+# http://www.statmethods.net/stats/regression.html
+
+stepAic1 <- stepAIC(max.model1, direction="both")
+stepAic1$anova # display results 
+
+# try all sub-sets using leaps
+#library(leaps)
+
+#leaps<-regsubsets(as.formula(mod2),  data = modelJan4_8DT[cut == 0],nbest=10)
+# view results
+#summary(leaps)
+# plot a table of models showing variables in each model.
+# models are ordered by the selection statistic.
+#plot(leaps,scale="r2")
+
+# try using bestglm
+# make bespoke data table with outcome being column 1
+# bestDT <- modelJan4_8DT[cut == 0, .(logMeankWh = log(meanHalfHourkWh), as.factor(ba_Q22_HRPethnicity), as.factor(ba_Q2D_HRPemplType.latest),  as.factor(ba_Q2B_HRPage.fullSurvey), as.factor(ba_censusNpeople.latest) , as.factor(ba_presenceChildren.latest) , as.factor(ba_nrooms) , as.factor(ba_heatSourceReduced) , as.factor(ba_Q3_6_mainsGas) , as.factor(ba_Q8_2_dwellingRecoded) , as.factor(ba_censusTenure))]
+
+# for some reason this fails (type/coding issues)
+#bestglm(bestDT, IC = "AIC")
+```
+
+These results suggest that, ideally, we would want to know the numbers of households in each OA who:
+
+ * have different heat sources (gas vs electricity vs other) - from Census 2011
+ * do / do not have gas - from DECC/BEIS sub-national energy statistics
+ * are in different income brackets
+ 
+Unforunately none of these are available at the OA level. Heat source and gas/no gas are available at the LSOA level but household income is not.
+
+
+We next run the second model containing just those variables that are available at OA level and the NSSEC 3 coding.
+
+```{r run stepAIC model 2}
+stepAic2 <- stepAIC(max.model2, direction="both")
+stepAic2$anova # display results 
+```
+
+
+## Summary
+
+The original models show the contributions each variable makes. Of note:
+
+ * Household attributes which are strong (& sig) predictors: No gas, 1+ child, other electric/gas heat, n people; n cars & tenure are marginal
+ * HRP attributes are relatively non-predictive except: retired, mixed ethnicity, asian, NS-SEC 7 coding
+
+The final model based on variables available at OA level produced using stepAIC() suggest that...
+
+ * Household attributes: N people, presence children, n rooms, dwelling type, number of cars
+ * HRP attributes required: ethnicity, employment status, age
+ 
+This tells us that our visual inspection of the full model was basically correct. We will use this set of constraints in our initial ipf spatial microsimulation process.
+
+```{r clean up model tables}
+modelJan30min4_8DT <- NULL
+```
+
+# Process census data
+
+We need `r area` level counts of households in the various constraint categories identified above. See table above for their availability at OA level.
+
+We have downloaded the relevant OA level tables from nomisweb (https://www.nomisweb.co.uk/census/2011/data_finder). Some of these are univariate tables but others are bivariate tables which allow us to calculate two or more constraints at the same time should we wish to.
+
+They have been downloaded and then the key columns we need have been selected and saved to a new (clean) .csv file with simple column headings. This was done manually in excel (!). We load and merge these files below being sure to keep only the OAs in the study area.
+
+
+## Household attributes
+
+First we load HRP attribute data starting with age. Note the age categories available, we will need to match the age categories in our microdata (household attributes) to these.
+
+Needed: N people, presence children, n rooms, dwelling type, number of cars
+
+First we load household attribute data starintg with the number of people.
+
+```{r load household census data hh size}
+
+# > N people (required) ----
+oa2011HhSizeDT <- as.data.table(read_csv(paste0(dPath, "census/clean/2011_OA_HHSIZE_clean.csv.gz")))
+
+# check sum - need to make sure the n hrps  in each category adds up to the total
+oa2011HhSizeDT <- oa2011HhSizeDT[, sumAge := npeople_1 + npeople_2 + npeople_3 + npeople_4 + npeople_5 + npeople_6 +
+                                   npeople_7 + npeople_8m]
+
+oa2011HhSizeDT <- oa2011HhSizeDT[, checkDiffAge := nHhSpacesPeople - sumAge] # there may be differences as the total = hh spaces
+setkey(oa2011HhSizeDT, oaCode)
+#summary(oa2011HhSizeDT)
+
+kable(caption = "Census data: Household size", head(oa2011HhSizeDT))
+
+```
+
+```{r load presence of children}
+# > presence of children (required) ----
+
+oa2011nKidsDT <- as.data.table(read_csv(paste0(dPath, "census/clean/2011_OA_HHType_nkids_clean.csv.gz")))
+
+# check sum - need to make sure the n hrps  in each category adds up to the total
+oa2011nKidsDT <- oa2011nKidsDT[, sumKids := nchild_0 + nchild_1m]
+
+oa2011nKidsDT <- oa2011nKidsDT[, checkDiffKids := nHsChildren - sumKids] # there may be differences as the total = hh spaces
+# remove variables we don't use
+oa2011nKidsDT <- oa2011nKidsDT[, .(oaCode, nHsChildren, nchild_0, nchild_1m, sumKids, checkDiffKids)]
+setkey(oa2011nKidsDT, oaCode)
+#summary(oa2011nKidsDT)
+
+kable(caption = "Census data: Household size", head(oa2011nKidsDT))
+```
+
+Next n rooms.
+```{r load n rooms}
+# > N rooms(required) ----
+
+oa2011nRoomsDT <- as.data.table(read_csv(paste0(dPath, "census/clean/2011_OA_nRooms_clean.csv.gz")))
+
+# check sum - need to make sure the n hrps  in each category adds up to the total
+oa2011nRoomsDT <- oa2011nRoomsDT[, sumRooms := rooms_1 + rooms_2 + rooms_3 + rooms_4 + rooms_5 + rooms_6m]
+
+oa2011nRoomsDT <- oa2011nRoomsDT[, checkDiffRooms := nHhsRooms - sumRooms] # there may be differences as the total = hh spaces
+setkey(oa2011nRoomsDT, oaCode)
+#summary(oa2011nRoomsDT)
+
+kable(caption = "Census data: N rooms", head(oa2011nRoomsDT))
+```
+
+```{r census dwelling type and energy sources}
+# > dwelling type ----
+oa2011AccomDT <- as.data.table(read_csv(paste0(dPath, "census/clean/2011_OA_HH_Accommodation_clean.csv.gz")))
+
+# check sum - need to make sure the n hhs in each category adds up to the total
+oa2011AccomDT <- oa2011AccomDT[, sumAccom :=  accom_houseDet + accom_houseSemiDet + accom_houseTerr + accom_purposeBuiltFlat + accom_convertedFlat + accom_commercialBldgFlat + accom_caravanTemp + nHhsSharedDwelling]
+
+oa2011AccomDT <- oa2011AccomDT[, checkDiffAccom := nHsAccom - sumAccom] # there may be differences as the total = hh spaces
+# remove variables we don't use
+oa2011AccomDT <- oa2011AccomDT[, .(oaCode, nHsAccom, accom_houseDet, accom_houseSemiDet, accom_houseTerr, accom_purposeBuiltFlat, accom_convertedFlat, accom_commercialBldgFlat, accom_caravanTemp, sumAccom, checkDiffAccom)]
+setkey(oa2011AccomDT, oaCode)
+#summary(oa2011AccomDT)
+
+kable(caption = "Census data: Accommodation type summary", summary(oa2011AccomDT, digits = 2))
+
+kable(caption = "Census data: Accommodation type", head(oa2011AccomDT))
+
+# > main heat source (if possible) ----
+# LSOA only?
+
+# > presence of gas ----
+# use DECC count of MPANs to get count with/without?
+```
+
+Now census tenure.
+
+```{r census tenure}
+# > Tenure ----
+# extract from HRP ethnicity & tenure table
+oa2011HrpEthnicityTenureDT <- as.data.table(read_csv(paste0(dPath, "census/clean/2011_OA_HRP_Ethnicity_Tenure_Age_clean.csv.gz")))
+
+oa2011TenureDT <- oa2011HrpEthnicityTenureDT[, .(oaCode = geography,
+                                                       nTenure = nHhsEthnicity,
+                                                 tenureOwned ,
+                                                       tenureSocialRented,
+                                                       tenurePrivateRented)]
+# check sum
+oa2011TenureDT <- oa2011TenureDT[, sumTenure := tenureOwned + tenureSocialRented + tenurePrivateRented]
+
+oa2011TenureDT <- oa2011TenureDT[, checkDiffTenure := nTenure - sumTenure]
+setkey(oa2011TenureDT, oaCode)
+#summary(oa2011TenureDT)
+
+kable(caption = "Census data: tenure", head(oa2011TenureDT))
+```
+
+And finally, number of cars.
+
+```{r load n cars}
+# > N cars ----
+oa2011nCarsDT <- as.data.table(read_csv(paste0(dPath, "census/clean/2011_OA_CarsVans_clean.csv.gz")))
+
+# check sum - need to make sure the n hrps  in each category adds up to the total
+oa2011nCarsDT <- oa2011nCarsDT[, sumCars := cars_0 + cars_1 + cars_2 + cars_3 + cars_4m]
+
+oa2011nCarsDT <- oa2011nCarsDT[, checkDiffCars := nHhsCars - sumCars] # there may be differences as the total = hh spaces
+setkey(oa2011nCarsDT, oaCode)
+# summary(oa2011nCarsDT)
+
+kable(caption = "Census data: n cars", head(oa2011nCarsDT))
+```
+
+## HRP attributes
+
+HRP attributes required: ethnicity, employment status, age
+
+Now we load HRP attribute data starting with age. Note the age categories available. We will need to match the survey age categories to this.
+
+Ethnicity data.
+
+```{r load HRP ethnicity data}
+# > HRP ethnicity (required) ----
+oa2011HrpEthnicityTenureDT <- as.data.table(read_csv(paste0(dPath, "census/clean/2011_OA_HRP_Ethnicity_Tenure_Age_clean.csv.gz")))
+
+# create seperate table with correct names to match survey
+oa2011HrpEthnicityDT <- oa2011HrpEthnicityTenureDT[, .(oaCode = geography,
+                                                       nHrpsEthnicity = nHhsEthnicity,
+                                                       hrpWhite = nHrpWhite,
+                                                       hrpMixed = nHrpMixed,
+                                                       hrpAsianBrit = nHrpAsianBrit,
+                                                       hrpBlackBrit = nHrpBlackBrit,
+                                                       hrpOtherEthnic = nHrpOtherEthnic)]
+
+# check sum - need to make sure the n hrps  in each category adds up to the total
+oa2011HrpEthnicityDT <- oa2011HrpEthnicityDT[, sumEthnicity := hrpWhite + hrpMixed + hrpAsianBrit + hrpBlackBrit + hrpOtherEthnic]
+
+oa2011HrpEthnicityDT <- oa2011HrpEthnicityDT[, checkDiffEthnicity := nHrpsEthnicity - sumEthnicity]
+setkey(oa2011HrpEthnicityDT, oaCode)
+#summary(oa2011HrpEthnicityDT)
+
+kable(caption = "Census data: HRP ethnicity", head(oa2011HrpEthnicityDT))
+```
+
+Now HRP employment status.
+
+> to do
+
+```{r load hrp employment status}
+# > HRP employment status (required) ----
+# NB in survey this is: f/t, retired, other - not p/t?
+
+oa2011HrpEmplDT <- as.data.table(read_csv(paste0(dPath, "census/clean/2011_OA_HRP_EconAct_clean.csv.gz")))
+
+# pre-configured to match BMG survey codes - we did not ask work hours of self-employed :-(
+
+# check sum - need to make sure the n hrps  in each category adds up to the total
+oa2011HrpEmplDT <- oa2011HrpEmplDT[, sumHrpEmpl := econActEmplFt + econActEmplPt + 
+                                     econActSelfEmpl +
+                                     econActUnemp + econActInactiveRetired +
+                                     econActInactiveOther]
+
+oa2011HrpEmplDT <- oa2011HrpEmplDT[, checkDiffEmpl :=  nHrpEconAct - sumHrpEmpl ]
+setkey(oa2011HrpEmplDT, oaCode)
+s <- summary(oa2011HrpEmplDT, digits = 2)
+kable(caption = "Census data summary: HRP employment", s)
+kable(caption = "Census data: HRP employment", head(oa2011HrpEmplDT))
+```
+
+Age - note that we have to collapse 35-49 & 50-64 into 35-64 as the boundary 49/50 is not available in the SAVE survey.
+
+```{r load HRP age data}
+# > HRP age (required) ----
+
+oa2011HrpAgeDT <- as.data.table(read_csv(paste0(dPath, "census/clean/2011_OA_HRP_Age_Living_Arrangements_clean.csv.gz")))
+setkey(oa2011HrpAgeDT, oaCode)
+
+# adjust constraints to match
+oa2011HrpAgeDT <- oa2011HrpAgeDT[, hrp35_64 := hrp35_49 + hrp50_64]
+oa2011HrpAgeDT <- oa2011HrpAgeDT[, c("hrp35_49","hrp50_64") := NULL] # remove those columns to prevent confusion later
+
+# force variable order
+oa2011HrpAgeDT <- oa2011HrpAgeDT[,.(oaCode, hrp16_24,hrp25_34,hrp35_64,hrp65plus)]
+
+kable(caption = "Census data: HRP age", head(oa2011HrpAgeDT))
+```
+
+Now NS-SEC coding.
+
+As noted above we are currently unclear on the validity of our survey NS-SEC coding and so are not currently using this constraint.
+
+```{r load hrlp ns-sec coding}
+# > HRP NS-SEC (not required for now) ----
+# Census form: (https://www.nomisweb.co.uk/census/2011/lc6115ew)
+    # All categories: NS-SeC (default)
+    # 1. Higher managerial, administrative and professional occupations
+    # 2. Lower managerial, administrative and professional occupations
+    # 3. Intermediate occupations
+    # 4. Small employers and own account workers
+    # 5. Lower supervisory and technical occupations
+    # 6. Semi-routine occupations
+    # 7. Routine occupations
+    # 8. Never worked and long-term unemployed
+    # L15 Full-time students
+
+```
+
+Now combine the constraint variables into one table and filter out only the ones we want for Hampshire, Southampton, Portsmouth and the Isle of Wight. First, we present a table showing the number of OAs in the SAVE study area.
+
+XXX provide further commentary on charts displayed by chunk below - paired charts etc XXX
+The following charts and summary tables provide a check on ... using the observations in constraint sub-categories . . . 
+
+```{r merge constraints}
+# this is for all of the South East region
+# this is a default order, might need to re-order based on regresssion results above
+oa2011consDT <- oa2011HhSizeDT[oa2011HrpAgeDT][oa2011HrpEthnicityDT][oa2011HrpEmplDT][oa2011nCarsDT][oa2011nKidsDT][oa2011nRoomsDT][oa2011TenureDT][oa2011AccomDT]
+setkey(oa2011consDT, oaCode)
+
+# all OAs
+oa2011allDT <- as.data.table(read_csv(paste0(dPath, "geo/RUC11_OA11_EW.csv.gz")))
+
+oa2011allDT <- oa2011allDT[, oaCode := OA11CD]
+setkey(oa2011allDT, oaCode)
+
+# merge the LA & county labels
+oa2011LUTDT <- as.data.table(read_csv(paste0( dPath, "geo/oacode_2011_lookups.csv.gz")))
+oa2011LUTDT <- oa2011LUTDT[, oaCode := oa11cd]
+setkey(oa2011LUTDT, oaCode)
+
+oa2011SAVELUTDT <- oa2011LUTDT[countyname_2011 %like% "Hampshire" | 
+                                 laname_2011 %like% "Wight" | 
+                                 laname_2011 %like% "Southampton" | 
+                                 laname_2011 %like% "Portsmouth" ]
+
+
+# check
+t <- table(oa2011SAVELUTDT$laname_2011, oa2011SAVELUTDT$countyname_2011, useNA = "always")
+
+# TR - add totals
+t <- addmargins(t) # add sum column and row to table "t"
+# rename column and row - from "Sum" to "Total"
+colnames(t)[colnames(t)=="Sum"] = "Total"
+rownames(t)[rownames(t)=="Sum"] = "Total"
+kable(caption = "Count of OAs in each LA or county", t)
+
+# filter our constraints to match and reduce number of variables
+setkey(oa2011SAVELUTDT, oaCode)
+oa2011consSAVEDT <- oa2011consDT[oa2011SAVELUTDT[, .(oaCode)]]
+
+# test total counts input from census
+hhCountsDT <- oa2011consSAVEDT[, .(nHhSpacesPeople, nHhsCars, nHhsRooms,nHsAccom)]
+pairs(hhCountsDT)
+summary(hhCountsDT, digits = 2)
+
+# test sum counts (derived from constraints)
+
+hhSumsDT <- oa2011consSAVEDT[, .(sumAge, sumEthnicity, sumCars, sumKids, sumRooms, sumTenure,sumAccom)]
+pairs(hhSumsDT)
+kable(caption = "Summary of sums of constraint sub-categories", summary(hhSumsDT, digits = 2))
+
+
+# test sum differences (derived from both)
+
+hhDiffsDT <- oa2011consSAVEDT[, .(checkDiffAge, checkDiffEthnicity, checkDiffEmpl,checkDiffCars, checkDiffKids, checkDiffRooms, checkDiffTenure, checkDiffAccom)]
+kable(caption = "Summary of differences in counts between census constraint totals (from census) and the sum of the sub-categories per constraint", summary(hhDiffsDT, digits = 2))
+
+```
+
+We have `r nrow(oa2011consSAVEDT)` rows of data corresponsing to individual OAs, with constraint sub-categories containing a minimum of `r min(hhCountsDT)` and a maximum of `r max(hhCountsDT)` households.
+
+It looks like the constraints always add up to the correct number of households (maximum difference in cell counts are reported as +/- `r max(hhDiffsDT)`/`r min(hhDiffsDT)`) which is slightly unusual... Normally we have counts of +/- 2-3 due to record swapping etc.
+
+# Set up data for IPF
+
+```{r set up constraints data for ipf}
+# relies heavily on: http://robinlovelace.net/spatial-microsim-book/smsim-in-R.html
+
+setkey(oa2011consSAVEDT, oaCode) # force order now & hope ipfp preserves it
+oaCodes <- oa2011consSAVEDT[, oaCode]
+```
+
+```{r set up survey data for ipf}
+# Set up survey data, removing non-surveyed ----
+
+testSurveyDT <- hhAttributesDT[ba_surveyMode.fullSurvey != "No survey data", .(bmg_id, bmgGroup.respF, ba_surveyMode.fullSurvey, ba_Q22_HRPethnicity, ba_Q2B_HRPage.fullSurvey, ba_Q2D_HRPemplType.latest, ba_censusNpeople.latest, ba_censusNcars, ba_censusTenure, ba_presenceChildren.latest)]
+
+print("Surveyed households:")
+uniqueN(testSurveyDT$bmg_id)
+
+# Apply filters ----
+# > - no kWh data in the baseline period ----
+
+havekWhDT <- modelJan4_8DT[, .(nObs = .N), keyby = bmg_id]
+setkey(havekWhDT, bmg_id)
+
+testSurveyDT <- merge(testSurveyDT, havekWhDT)
+
+print("Surveyed households with kWh data in Jan 2017:")
+uniqueN(testSurveyDT$bmg_id)
+
+# Convert to dummy vars - probably a simpler way to do this ----
+
+# > HRP data ----
+# >> hrp16_24, hrp25_34, hrp35_49, hrp50_64, hrp65plus ----
+# hrp16_24, hrp25_34, hrp35_49, hrp50_64, hrp65plus
+# table(testSurveyDT$ba_Q2B_HRPage.fullSurvey, useNA = "always")
+# refusals here
+testSurveyDT <- testSurveyDT[ba_Q2B_HRPage.fullSurvey != "Refused"] # no refusals
+
+testSurveyDT <- testSurveyDT[, hrp16_24 := ifelse(ba_Q2B_HRPage.fullSurvey == "18 - 24",1,0)]
+testSurveyDT <- testSurveyDT[, hrp25_34 := ifelse(ba_Q2B_HRPage.fullSurvey == "25 - 34",1,0)]
+# NB have to group - see above (forced by Census OA hrp age cuts)
+testSurveyDT <- testSurveyDT[, hrp35_64 := ifelse(ba_Q2B_HRPage.fullSurvey == "35 - 44" | 
+                                                    ba_Q2B_HRPage.fullSurvey == "45 - 54" |
+                                                    ba_Q2B_HRPage.fullSurvey == "55 - 64",1,0)]
+testSurveyDT <- testSurveyDT[, hrp65plus := ifelse(ba_Q2B_HRPage.fullSurvey == "65 - 74" |
+                                                     ba_Q2B_HRPage.fullSurvey == "75+",1,0)]
+
+print("Surveyed households with kWh data in Jan 2017, age recoded:")
+uniqueN(testSurveyDT$bmg_id)
+
+# >> hrpWhite, hrpMixed, hrpAsianBrit,hrpBlackBrit, hrpOtherEthnic ----
+# table(testSurveyDT$ba_Q22_HRPethnicity, useNA = "always")
+testSurveyDT <- testSurveyDT[!is.na(ba_Q22_HRPethnicity)] # no NAs
+# refusals here
+testSurveyDT <- testSurveyDT[ba_Q22_HRPethnicity != "Refused"] # no refusals
+testSurveyDT <- testSurveyDT[, hrpWhite := ifelse(ba_Q22_HRPethnicity == "White British/Irish",1,0)]
+testSurveyDT <- testSurveyDT[, hrpMixed := ifelse(ba_Q22_HRPethnicity == "Mixed",1,0)]
+testSurveyDT <- testSurveyDT[, hrpAsianBrit := ifelse(ba_Q22_HRPethnicity == "Asian/Asian British",1,0)]
+testSurveyDT <- testSurveyDT[, hrpBlackBrit := ifelse(ba_Q22_HRPethnicity == "Black/Black British",1,0)]
+testSurveyDT <- testSurveyDT[, hrpOtherEthnic := ifelse(ba_Q22_HRPethnicity == "Other",1,0)]
+
+print("Surveyed households with kWh data in Jan 2017, ethnicity recoded:")
+uniqueN(testSurveyDT$bmg_id)
+
+# >> econActEmplFt, econActEmplPt, econActSelfEmpl, econActUnemp, econActInactiveRetired, econActInactiveOther ----
+# table(testSurveyDT$ba_Q2D_HRPemplType.latest, useNA = "always")
+# econActEmplFt, econActEmplPt, econActSelfEmpl, econActUnemp, econActInactiveRetired, econActInactiveOther
+# refusals here
+testSurveyDT <- testSurveyDT[, econActEmplFt := ifelse(ba_Q2D_HRPemplType.latest == "HRP in full-time employment",1,0)]
+testSurveyDT <- testSurveyDT[, econActEmplPt := ifelse(ba_Q2D_HRPemplType.latest == "HRP in part-time employment (8-29 hours/week)",1,0)]
+testSurveyDT <- testSurveyDT[, econActSelfEmpl := ifelse(ba_Q2D_HRPemplType.latest == "HRP self-employed (unkown hours)",1,0)]
+testSurveyDT <- testSurveyDT[, econActUnemp := ifelse(ba_Q2D_HRPemplType.latest == "Unemployed",1,0)]
+testSurveyDT <- testSurveyDT[, econActInactiveRetired := ifelse(ba_Q2D_HRPemplType.latest == "HRP retired",1,0)]
+testSurveyDT <- testSurveyDT[, econActInactiveOther := ifelse(ba_Q2D_HRPemplType.latest == "Other",1,0)]
+
+print("Surveyed households with kWh data in Jan 2017, HRP employment recoded:")
+uniqueN(testSurveyDT$bmg_id)
+
+# > Household data ----
+# >> npeople_1, npeople_2, npeople_3, npeople_4, npeople_5, npeople_6, npeople_7, npeople_8m ----
+testSurveyDT <- testSurveyDT[!is.na(ba_censusNpeople.latest)] # no NAs
+testSurveyDT <- testSurveyDT[, npeople_1 := ifelse(ba_censusNpeople.latest == "1",1,0)]
+testSurveyDT <- testSurveyDT[, npeople_2 := ifelse(ba_censusNpeople.latest == "2",1,0)]
+testSurveyDT <- testSurveyDT[, npeople_3 := ifelse(ba_censusNpeople.latest == "3",1,0)]
+testSurveyDT <- testSurveyDT[, npeople_4 := ifelse(ba_censusNpeople.latest == "4",1,0)]
+testSurveyDT <- testSurveyDT[, npeople_5 := ifelse(ba_censusNpeople.latest == "5",1,0)]
+testSurveyDT <- testSurveyDT[, npeople_6 := ifelse(ba_censusNpeople.latest == "6",1,0)]
+testSurveyDT <- testSurveyDT[, npeople_7 := ifelse(ba_censusNpeople.latest == "7",1,0)]
+testSurveyDT <- testSurveyDT[, npeople_8m := ifelse(ba_censusNpeople.latest == "8+",1,0)]
+
+# >> nchild_0, nchild_1m ----
+testSurveyDT <- testSurveyDT[ba_presenceChildren.latest != "Refused"] # no refusals
+testSurveyDT <- testSurveyDT[, nchild_0 := ifelse(ba_presenceChildren.latest == "0 children",1,0)]
+testSurveyDT <- testSurveyDT[, nchild_1m := ifelse(ba_presenceChildren.latest == "1+ child",1,0)]
+
+# >> cars_0, cars_1, cars_2, cars_3, cars_4m, ----
+testSurveyDT <- testSurveyDT[ba_censusNcars != "Refused"] # no refusals
+testSurveyDT <- testSurveyDT[, cars_0 := ifelse(ba_censusNcars == "0",1,0)]
+testSurveyDT <- testSurveyDT[, cars_1 := ifelse(ba_censusNcars == "1",1,0)]
+testSurveyDT <- testSurveyDT[, cars_2 := ifelse(ba_censusNcars == "2",1,0)]
+testSurveyDT <- testSurveyDT[, cars_3 := ifelse(ba_censusNcars == "3",1,0)]
+testSurveyDT <- testSurveyDT[, cars_4m := ifelse(ba_censusNcars == "4+",1,0)]
+
+
+# >> tenureOwned, tenureSocialRented, tenurePrivateRented ----
+testSurveyDT <- testSurveyDT[ba_censusTenure != "x.Refused/dk/Other"] # no refusals etc - this removes quite a few
+testSurveyDT <- testSurveyDT[, tenureOwned := ifelse(ba_censusTenure == "Own",1,0)]
+testSurveyDT <- testSurveyDT[, tenureSocialRented := ifelse(ba_censusTenure == "Social rent",1,0)]
+testSurveyDT <- testSurveyDT[, tenurePrivateRented := ifelse(ba_censusTenure == "Private rent",1,0)]
+
+#head(testSurveyDT)
+
+# now we have to remove the bmg_id & turn it round for ipfp - ipfp does not want non-numeric input
+# this is very anonying as it means we can't allow the bmg_id to pass through the ipf
+setkey(testSurveyDT, bmg_id) # force order now & hope ipfp preserves it
+
+# Save the final constraints & survey data for future use ----
+v <- paste0(version, "all_hhs")
+
+saveFileName <- paste0("SAVEipfHouseholds_v", v)
+
+ofile <- paste0(dPath,"processed/",saveFileName,".csv")
+print(paste0("Saving testSurveyDT to ", ofile))
+
+
+write_csv(testSurveyDT, ofile)
+rawF <- file.size(ofile)
+
+print("Variables saved:")
+print(names(testSurveyDT))
+
+print("Surveyed households with kWh data in Jan 2017 (after removal of NAs & refusals for ipf):")
+uniqueN(testSurveyDT$bmg_id)
+
+# remove old gzip file
+cmd <- paste0("rm ", ofile, ".gz")
+try(system(cmd)) # will fail on first run
+# now gzip new one
+print(paste0("gzipping file to: ", ofile, ".gz"))
+cmd <- paste0("gzip ", ofile)
+try(system(cmd)) # in case it fails (it will on windows - you will be left with a .csv file)
+gzipF <- file.size(paste0(ofile, ".gz"))
+
+saveFileName <- paste0("SAVEipfOaConstraints_v", v)
+
+ofile <- paste0(dPath,"processed/",saveFileName,".csv")
+print(paste0("Saving oa2011consSAVEDT to ", ofile))
+
+
+write_csv(oa2011consSAVEDT, ofile)
+rawF <- file.size(ofile)
+
+print("Variables saved:")
+print(names(oa2011consSAVEDT))
+
+# remove old gzip file
+cmd <- paste0("rm ", ofile, ".gz")
+try(system(cmd)) # will fail on first run
+# now gzip new one
+print(paste0("gzipping file to: ", ofile, ".gz"))
+cmd <- paste0("gzip ", ofile)
+try(system(cmd)) # in case it fails (it will on windows - you will be left with a .csv file)
+gzipF <- file.size(paste0(ofile, ".gz"))
+
+
+```
+
+The chunk above sets up the two input data files we need:
+
+ * constraints - area level counts of categories (`r nrow(oa2011consSAVEDT)` rows/OAs, `r ncol(oa2011consSAVEDT)` constraint sub-categories)
+ * survey data - with the constraints in the same form (`r nrow(testSurveyDT)` rows/households, `r ncol(testSurveyDT)` constraint sub-categories)
+ 
+> NB: we lose a lot of survey households due to the non-response to the HRP ethnicity item. Potentially could skip this constraint?
+
+# Baseline Model (all households)
+
+Use IPF to create, for each OA (or LSOA)
+
+ * a weight per household for whom we have navetas data in the test time period (will filter out WMG drop-outs etc)
+ 
+Such that the aggregate tables of the counts/proportions of the of the constraint variables used match the original census tables
+
+Note that using IPF, the constraint fitted last will always fit perfectly so we order the constraints in terms of their contribution to the best fit model.
+
+## Run baseline ipf
+
+Next we run the ipf to create a big table of weights for all households (rows) for all OAs (columns). We reproduce the first few rows/columns below:
+
+```{r Baseline run initial all group ipf}
+# relies heavily on: http://robinlovelace.net/spatial-microsim-book/smsim-in-R.html
+# ipfp wants:
+# y = numeric vector of constraints for each area - the names must match the survey, data as counts, no ids or anything
+# which means we have to assume ipfp maintains order
+# A = the transposed survey data (row = var) in dummy variable form - names must match constraints, no ids etc, variables that match constraints only
+# which means we have to assume ipfp maintains order
+# x0 = numeric initial vector of length ncol
+
+# set order of variables - has to match (probably)
+
+# NB: we set the order of constraints here - we set n people as the last constraint so that the ipf algoritm fits it perfectly
+# This is a design choice and it is driven by the size of the coefficients in the lm models above.
+
+constraintsList <- c("hrpWhite", "hrpMixed", "hrpAsianBrit", "hrpBlackBrit", "hrpOtherEthnic",
+                     "hrp16_24", "hrp25_34", "hrp35_64", "hrp65plus",
+                     "econActEmplFt","econActEmplPt","econActSelfEmpl","econActUnemp","econActInactiveRetired","econActInactiveOther",
+                     "tenureOwned","tenureSocialRented","tenurePrivateRented",
+                     "cars_0","cars_1","cars_2","cars_3","cars_4m",
+                      "nchild_0","nchild_1m",
+                     "npeople_1", "npeople_2", "npeople_3", "npeople_4", "npeople_5", "npeople_6", "npeople_7", "npeople_8m"
+) # we use just these later - useful to have a list in the correct order
+
+# Survey setup - remove bmg_id - we have to trust that ipfp maintains the order!!! ----
+
+ipfSurveyDT <- testSurveyDT[, constraintsList, with=FALSE]
+
+# Census setup - remove oaCode - we have to trust that ipfp maintains the order!!! ----
+ipfConsDT <- oa2011consSAVEDT[, constraintsList, with=FALSE]
+
+oaCodes <- oa2011consSAVEDT[, oaCode]
+hhIdsDT <- testSurveyDT[, .(bmg_id)]
+
+# cons_dt = ipfConsDT,surv_dt = ipfSurveyDT, hhId_dt = hhIdsDT, oaList = oaCodes
+ba_runIPF <- function(cons_dt,surv_dt, hhId_dt, oaList){
+  # expects:
+  #   an area level constraints dt and a survey constraints dt (vars in same order)
+  #   a list of area codes & bmg_ids
+  # returns a long form weights dt with oacodes labelled
+  
+  # transpose the survey for ipf
+  surv_dtt <- t(surv_dt)
+  
+  # set area counts 0 to a very very small number
+  
+  #summary(ipfConsDT)
+  # decided not to do this as it prevents true zero weights so we cannot filter them out later
+  # cols <- names(ipfConsDT)
+  # for(col in cols){
+  #   ipfConsDT <- ipfConsDT[, (col) := ifelse( get(col) == 0, 0.00001, get(col))]
+  # }
+  #summary(ipfConsDT) # may look like 0 but in fact it's just a very very small number
+  
+  # force numeric just in case
+  cons_dtn <- apply(cons_dt, 2, as.numeric)
+  #summary(ipfConsDTn)
+  
+  # set up initial vector as a load of 1s (could be the response weights...)
+  x0 <- rep(1, nrow(surv_dt))
+  
+  # create 2D weight matrix (households by areas)
+  weights = array(NA, dim=c(nrow(surv_dt), nrow(cons_dt)))
+  
+  for(i in 1:ncol(weights)){
+    weights[,i] <- ipfp(cons_dtn[i,], surv_dtt, x0, maxit = 20, verbose = F) # maxit = 20 iterations (for now), no verbose - too much output
+  }
+  
+  
+  # this is where we have to assume ipfp maintained the oaCode order
+  wDT <- as.data.table(weights) # convert to dt
+  wDT <- setnames(wDT, oaList) # add col names
+  
+  # this is where we have to assume ipfp maintained the bmg_id order
+  dt <- cbind(hhId_dt, wDT) # add bmg_id to each row of weights (fills down, if you get recycling then there is a problem)
+  setkey(dt, bmg_id)
+  
+  # need to reshape the weightsDT into 1 column of weights with oacode set to oacode etc
+  longFormDT <- as.data.table(melt(dt, id=c("bmg_id")))
+  longFormDT <- setnames(longFormDT, c("variable", "value"), c("oaCode", "ipfWeight"))
+  wDT <- NULL # save mem
+  return(longFormDT)
+}
+
+t2 <- proc.time()
+print(paste0("Running ipf using ", nrow(ipfSurveyDT), " households for ", nrow(ipfConsDT), " areas."))
+
+print(paste0("Running ipf with: ", constraintsList))
+
+# set 0 counts to small number
+cols <- names(ipfConsDT)
+tmpDT <- ipfConsDT
+for(col in cols){
+  tmpDT <- tmpDT[, (col) := ifelse( get(col) == 0, 0.00001, get(col))]
+}
+
+ipfConsDTnz <- tmpDT
+
+summary(ipfConsDTnz, digits = 2) # may look like 0 but in fact it's just a very very small number
+
+longFormDT <- ba_runIPF(ipfConsDTnz,ipfSurveyDT, hhIdsDT, oaCodes)
+t2tot <- proc.time() - t2
+elapsed <- t2tot[[3]]
+print(paste0("ipf baseline: ", round(elapsed,2), " seconds"))
+
+kable(caption = "First few rows/columns of ipf weights table", head(longFormDT))
+
+kable(caption = "Summary of ipf weights table", summary(longFormDT, digits = 2))
+
+print(paste0("Dimensions of ipf weights table: ", ba_tidyNum(nrow(longFormDT)), " rows x ", ncol(longFormDT), " cols"))
+
+```
+
+> Need to test that ipfp has preserved the ordering of the survey households and area codes so that our inferred linkage back to bmg_id/oaCode works (but how?)!
+
+This produces a long form data file of nrow = nrow (bmg_id) x ncol (OAs) (!) so that bmg_ids are repeated and the weights are held in a single column (variable). This makes everything else a lot easier later. The weighted households are base `r ba_tidyNum(uniqueN(longFormDT$bmg_id))` survey households.
+
+```{r Baseline summary of long form dt}
+
+
+st <- summary(longFormDT, digits = 2)
+kable(caption = "Summary of the long form ipf weights table", st)
+
+print("Number of OAs represented:")
+
+uniqueN(longFormDT$oaCode)
+
+```
+
+The dataset comprises `r ba_tidyNum(uniqueN(longFormDT$bmg_id))` SAVE survey households replicated (with weights) across `r ba_tidyNum(uniqueN(longFormDT$oaCode))` areas.
+
+## Test baseline household weighting for single output area
+
+Now let's pick an OA to test and add some survey data to it as we do so. The first few rows of the resulting table are reproduced below:
+
+```{r Baseline pick a test OA}
+setkey(longFormDT, bmg_id)
+setkey(testSurveyDT, bmg_id)
+# use merge to only keep the ones that match
+oa1DT <- merge(longFormDT[oaCode == "E00085919"],testSurveyDT[, .(bmg_id,ba_censusNpeople.latest, ba_Q22_HRPethnicity)])
+
+kable(caption = "First few rows of single OA table (all households)", head(oa1DT))
+```
+
+The table of number of households in each group should be identical for the original survey data and the unweighted ipf results for the OA unless we have removed some due to deleting weights = 0. If so they are likely to be rare categories.
+
+```{r Baseline test base tables}
+# the two base tables should be identical except where households get dropped as they have weight = 0
+kable(caption = "Unweighted survey (censusNpeople x ethnicity)", table(testSurveyDT$ba_Q22_HRPethnicity, testSurveyDT$ba_censusNpeople.latest, useNA = "always"))
+
+kable(caption = "Unweighted oa (censusNpeople x ethnicity)", table(oa1DT$ba_Q22_HRPethnicity, oa1DT$ba_censusNpeople.latest, useNA = "always"))
+
+```
+
+Also looks OK as we would expect since n_people is fitted last (and so will fit perfectly). Note that some Black/Black British have disappeared for this OA - their weight must have been 0.
+
+Now, when we calculate the weighted table for the OA we should get the same results as the OA row in the constraints table. If we have used more than 1 constraint we may get (hopefully small) errors.
+
+We do this first for the number of people per household.
+
+```{r Baseline compare weighted results - n people}
+
+kable(caption = "Census OA table of n people for E00085919",
+      head(oa2011HhSizeDT[oaCode == "E00085919", .(oaCode, npeople_1,npeople_2,	npeople_3,	npeople_4,	npeople_5,	npeople_6,	npeople_7,	npeople_8m)]
+      )
+)
+
+# but when we weight the oa table we should get the same counts as for that OA
+oa1DT <- oa1DT[, fakeCategory := "E00085919"]
+oa1SvyDes <- svydesign(ids = ~bmg_id, # use synthetic hh_id
+                                    weights = ~ipfWeight, 
+                                    data = oa1DT[!is.na(ipfWeight)] # make sure no NA weights
+                     )
+
+svyt <- svytable(~fakeCategory + ba_censusNpeople.latest, oa1SvyDes, 
+              exclude=NULL, 
+              na.action=na.pass
+              )
+
+kable(caption = "Weighted OA table of n people for E00085919", svyt)
+
+
+```
+
+Now repeat for ethnicity.
+
+```{r Baseline compare weighted results - ethnicity}
+kable(caption = "Census OA table of ethnicity for E00085919",
+      head(oa2011HrpEthnicityDT[oaCode == "E00085919", .(oaCode, hrpWhite, hrpMixed, hrpAsianBrit, hrpBlackBrit, hrpOtherEthnic)])
+)
+
+# repeat for ethnicity
+svyt <- svytable(~fakeCategory + ba_Q22_HRPethnicity, oa1SvyDes, 
+              exclude=NULL, 
+              na.action=na.pass
+              )
+
+kable(caption = "Weighted OA table of ethnicity for E00085919", svyt)
+
+
+
+oa1SvyDes <- NULL # remove to save mem
+```
+
+Are there any disparities? We would expect a few small ones? We can also see that we didn't need the 'Black/Black British' HRPs or the households with 8 or more members for this OA - hence their zero weighting.
+
+### Internal baseline validation
+
+We saw above that there may be small disparities in 'true' and 'simulated' household counts. To analyse this, we perform more systematic valdation tests. 
+
+To do so we calculate the weighted counts for each OA for each constraint used and compare this with the original constraint counts sourced from the Census.
+
+This involves:
+
+ * creating a matrix of the weighted counts (for all areas) that result from the ipf
+ * joining this to the census constraint data
+ * calculating the matrix difference
+
+To produce correctly weighted household counts we need to use the weights as part of a survey (weighted) table calculation. As an example the following table shows the ethnic distributions by household size for the entire SAVE study region.
+
+```{r Baseline ipf - test all constraints}
+setkey(longFormDT, bmg_id)
+setkey(testSurveyDT, bmg_id)
+
+reqVars <- c("bmg_id","bmgGroup.respF","ba_Q22_HRPethnicity","ba_Q2B_HRPage.fullSurvey",
+             "ba_Q2D_HRPemplType.latest","ba_censusNpeople.latest","ba_censusNcars",
+             "ba_censusTenure","ba_presenceChildren.latest", constraintsList)
+
+# add in variables we care about using merge so it only keeps the survey respondents who went through the ipf process
+
+longFormSvyDT <- merge(longFormDT, testSurveyDT[, reqVars, with = FALSE]) # you have no idea how long it took me to realise that I needed the with = FALSE. It's there, in the data.table FAQ, p3!
+
+# set svydesign
+longFormSvyDes <- svydesign(ids = ~bmg_id, # use synthetic hh_id
+                                    weights = ~ipfWeight, 
+                                    data = longFormSvyDT[!is.na(ipfWeight)] # make sure no NA weights
+                            )
+
+svyt <- svytable(~ba_Q22_HRPethnicity + ba_censusNpeople.latest, longFormSvyDes, 
+              exclude=NULL, 
+              na.action=na.pass
+              )
+kable(caption = "Ethnicity vs n people (all households, all OAs, SAVE study area, weighted ipf results)", svyt)
+
+# total number of households
+nHhsIpf <- margin.table(svyt)
+nHhsCensus <- sum(oa2011consSAVEDT$nHhSpacesPeople)
+```
+
+Note that we get fractional counts as this is weighted data.
+
+Based on this example table we have a simulated household population of `r ba_tidyNum(nHhsIpf)` households compared to the overall Census count of `r ba_tidyNum(nHhsCensus)` households (based on ethnicity table).
+
+We now turn to more systematic validation tests.
+
+```{r Baseline calculate weighted counts for all constraints for all OAs}
+
+# create a function to build weighted counts
+# this function is very slow - why?
+ba_SAVEcalculateWeightedCounts <- function(svyDes, dt, cl){
+  # expects a svyDes where:
+  ## every row is a household
+  ## ipfWeight = weight
+  ## the variables in cl (constraint list) exist
+  # dt which is the survey data used to create svyDes
+  tempDT <- dt[, .(nUnweightedHHs = .N), keyby = .(oaCode)]
+  for(c in cl){
+    # loop through constraint sub-categories
+    form <- as.formula(paste0("~oaCode + ", eval(c)))
+    svyT <- svytable(form, svyDes, 
+              exclude=NULL, 
+              na.action=na.pass
+              )
+    svyDT <- as.data.table(svyT)
+    # need to fiddle with the table as it is long form
+    svyDT <- svyDT[get(c) == 1] # keep the rows with 'Yes'
+    svyDT <- svyDT[, eval(c) := N] # set the variable to the N count
+    svyDT <- svyDT[, N := NULL] # remove N
+    setkey(svyDT, oaCode)
+    tempDT <- tempDT[svyDT]
+  }
+  return(tempDT)
+}
+
+# build weighted counts for each constraint category
+tempDT <- ba_SAVEcalculateWeightedCounts(longFormSvyDes, longFormDT, constraintsList)
+setkey(tempDT, oaCode)
+saeTestDT <- merge(tempDT,oa2011consSAVEDT,suffixes = c(".ipf", ".census"))
+
+tempDT <- saeTestDT[, .(oaCode)]
+setkey(tempDT, oaCode)
+
+for(c in constraintsList){
+  # loop through constraint sub-categories & calculate difference (error)
+  saeDT <- saeTestDT[, .(diff = get(paste0(c, ".ipf")) - get(paste0(c, ".census"))), keyby = .(oaCode)]
+  saeDT <- saeDT[, eval(c) := diff]
+  saeDT <- saeDT[, diff := NULL]
+  tempDT <- tempDT[saeDT]
+}
+
+
+
+testHrp <- c("hrpWhite", "hrpMixed", "hrpAsianBrit", "hrpBlackBrit", "hrpOtherEthnic",
+                     "hrp16_24", "hrp25_34", "hrp35_64", "hrp65plus" , "econActEmplFt","econActEmplPt","econActSelfEmpl","econActUnemp","econActInactiveRetired","econActInactiveOther"
+             )
+testHH <- c("tenureOwned","tenureSocialRented","tenurePrivateRented",
+                     "cars_0","cars_1","cars_2","cars_3","cars_4m",
+                     "nchild_0","nchild_1m",
+                     "npeople_1", "npeople_2", "npeople_3", "npeople_4", "npeople_5", "npeople_6", "npeople_7", "npeople_8m")
+
+kable(caption = "Summary of error table per HRP constraint category", t(summary(tempDT[, testHrp, with = FALSE], digits = 2)))
+
+kable(caption = "Summary of error table per household attribute constraint category", t(summary(tempDT[, testHH, with = FALSE], digits = 2)))
+```
+
+Notice the pattern of error. This will be a combination of the order of the constraints together with (probably) a lack of rare combinations of constraints in the survey data used.
+
+XX refs to lit here XX
+
+
+## Save baseline weights
+
+This could get really huge at OA level.
+
+```{r Baseline save weights}
+v <- paste0(version, "all_hhs")
+
+saveFileName <- paste0("SAVEipfHouseholdWeights_Baseline_v", v)
+  
+# keep only the variables we can't easily merge or recreate - e.g anything based on dateTimes (saves space)
+keepWeightsDT <- longFormDT[, .(
+  bmg_id,
+  oaCode,
+  ipfWeight)
+]
+
+# saveFileName is set in the calling wrapper to avoid writing over different extracts
+ofile <- paste0(dPath,"processed/",saveFileName,".csv")
+print(paste0("Saving longFormDT to ", ofile))
+
+
+write_csv(keepWeightsDT, ofile)
+rawF <- file.size(ofile)
+
+print("Variables saved:")
+print(names(keepWeightsDT))
+
+# remove old gzip file
+cmd <- paste0("rm ", ofile, ".gz")
+try(system(cmd)) # will fail on first run
+# now gzip new one
+print(paste0("gzipping file to: ", ofile, ".gz"))
+cmd <- paste0("gzip ", ofile)
+try(system(cmd)) # in case it fails (it will on windows - you will be left with a .csv file)
+gzipF <- file.size(paste0(ofile, ".gz"))
+```
+
+That produced a file of `r ba_tidyNum(rawF/bytesToMbytes)` Mb (`r ba_tidyNum(gzipF/bytesToMbytes)` Mb (gzipped)).
+
+# Model 1 (Trial Group 1: Control)
+
+In this model we repeat the process but using just Trial Group 1 (who received the event day notifications in TP1 (control group).
+
+## Run ipf
+
+Re-run using reduced household survey dataset (just TG1).
+
+```{r ipf model 1 TG 1}
+# as before but subset to Group 1 only
+ipfSurveyG1DT <- testSurveyDT[bmgGroup.respF == "BMG Group 1: Control", constraintsList, with = FALSE]
+
+oaCodes <- oa2011consSAVEDT[, oaCode] # stays the same
+hhIdsG1DT <- testSurveyDT[bmgGroup.respF == "BMG Group 1: Control", .(bmg_id)] # reduced list of ids
+
+t2 <- proc.time()
+print(paste0("Running ipf using ", nrow(ipfSurveyG1DT), " households for ", nrow(ipfConsDT), " areas."))
+
+print(paste0("Running ipf with: ", constraintsList))
+
+longFormG1DT <- ba_runIPF(ipfConsDTnz,ipfSurveyG1DT, hhIdsG1DT, oaCodes)
+t2tot <- proc.time() - t2
+elapsed <- t2tot[[3]]
+print(paste0("ipf 1: ", round(elapsed,2), " seconds"))
+
+kable(caption = "First few rows/columns of ipf weights table", head(longFormG1DT))
+
+kable(caption = "Summary of G1 ipf weights table", summary(longFormG1DT, digits = 2))
+
+print(paste0("Dimensions of ipf weights table: ", ba_tidyNum(nrow(longFormG1DT)), " rows x ", ncol(longFormG1DT), " cols"))
+
+print("Number of OAs represented (TG1):")
+
+uniqueN(longFormG1DT$oaCode)
+
+```
+
+This produces a long form weights file as above but this time based on just `r uniqueN(longFormG1DT$bmg_id)` survey households.
+
+
+```{r Model 1 pick a test OA}
+setkey(longFormG1DT, bmg_id)
+setkey(testSurveyDT, bmg_id)
+# use merge to only keep the ones that match
+# and where weight != 0
+oa1G1DT <- merge(longFormG1DT[oaCode == "E00085919"],testSurveyDT[bmgGroup.respF == "BMG Group 1: Control", .(bmg_id,ba_censusNpeople.latest, ba_Q22_HRPethnicity)])
+
+kable(caption = "First few rows of single OA table (BMG Group 1: Control)", head(oa1G1DT))
+
+
+```
+
+The table of number of households in each group should be identical for the original survey data and the unweighted ipf results for the OA. If so they are likely to be rare categories.
+
+```{r Model 1 test base tables}
+# the two base tables should be identical except where households get dropped as they have weight = 0
+kable(caption = "Unweighted survey (censusNpeople x ethnicity) - BMG Group 1: Control", table(testSurveyDT[bmgGroup.respF == "BMG Group 1: Control"]$ba_Q22_HRPethnicity, testSurveyDT[bmgGroup.respF == "BMG Group 1: Control"]$ba_censusNpeople.latest, useNA = "always"))
+
+kable(caption = "Unweighted oa (censusNpeople x ethnicity) - BMG Group 1: Control", table(oa1G1DT$ba_Q22_HRPethnicity, oa1G1DT$ba_censusNpeople.latest, useNA = "always"))
+
+```
+
+
+## Repeat systematic error tests
+
+First add survey data to the wieghted households and then test ethnicity vs household size as before.
+
+```{r ipf 1 - test all constraints}
+setkey(longFormG1DT, bmg_id)
+setkey(testSurveyDT, bmg_id)
+
+reqVars <- c("bmg_id","bmgGroup.respF","ba_Q22_HRPethnicity","ba_Q2B_HRPage.fullSurvey",
+             "ba_Q2D_HRPemplType.latest","ba_censusNpeople.latest","ba_censusNcars",
+             "ba_censusTenure","ba_presenceChildren.latest", constraintsList)
+
+# add in variables we care about using merge so it only keeps the survey respondents who went through the ipf process
+
+longFormG1SvyDT <- merge(longFormG1DT, testSurveyDT[, reqVars, with = FALSE]) # should just keep the ids that went through ipf
+
+#uniqueN(longFormG1DTSvyDT$bmg_id)
+
+# set svydesign
+longFormG1SvyDes <- svydesign(ids = ~bmg_id, # use synthetic hh_id
+                              weights = ~ipfWeight, 
+                              data = longFormG1SvyDT[!is.na(ipfWeight)] # make sure no NA weights
+)
+
+svyt <- svytable(~ba_Q22_HRPethnicity + ba_censusNpeople.latest, longFormG1SvyDes, 
+                 exclude=NULL, 
+                 na.action=na.pass
+)
+kable(caption = "Ethnicity vs n people (all households, all OAs, SAVE study area, weighted ipf results)", svyt)
+
+# total number of households
+nHhsIpf <- margin.table(svyt)
+nHhsCensus <- sum(oa2011consSAVEDT$nHhSpacesPeople)
+```
+
+Note that we get fractional counts as this is weighted data and we would expect more error as we have fewer numbers of households.
+
+Based on this example table we have a simulated household population of `r ba_tidyNum(nHhsIpf)` households compared to the overall Census count of `r ba_tidyNum(nHhsCensus)` households (based on ethnicity table).
+
+We now turn to more systematic validation tests.
+
+```{r model 1 calculate weighted counts for all constraints for all OAs}
+
+# build weighted counts for each constraint category
+tempDT <- ba_SAVEcalculateWeightedCounts(longFormG1SvyDes, longFormG1DT, constraintsList)
+setkey(tempDT, oaCode)
+saeTestDT <- merge(tempDT,oa2011consSAVEDT,suffixes = c(".ipf", ".census"))
+
+tempDT <- saeTestDT[, .(oaCode)]
+setkey(tempDT, oaCode)
+
+for(c in constraintsList){
+  # loop through constraint sub-categories & calculate difference (error)
+  saeDT <- saeTestDT[, .(diff = get(paste0(c, ".ipf")) - get(paste0(c, ".census"))), keyby = .(oaCode)]
+  saeDT <- saeDT[, eval(c) := diff]
+  saeDT <- saeDT[, diff := NULL]
+  tempDT <- tempDT[saeDT]
+}
+
+kable(caption = "Summary of error table per HRP constraint category", t(summary(tempDT[, testHrp, with = FALSE], digits = 2)))
+
+kable(caption = "Summary of error table per household attribute constraint category", t(summary(tempDT[, testHH, with = FALSE], digits = 2)))
+```
+
+As we can see in some areas we get some quite large errors.
+
+## Save TG1 (Control) weights file
+
+```{r TG1_Control save weights}
+v <- paste0(version, "TG1_Control")
+
+saveFileName <- paste0("SAVEipfHouseholdWeights_v", v)
+  
+# keep only the variables we can't easily merge or recreate - e.g anything based on dateTimes (saves space)
+keepWeightsDT <- longFormG1DT[, .(
+  bmg_id,
+  oaCode,
+  ipfWeight)
+]
+
+# saveFileName is set in the calling wrapper to avoid writing over different extracts
+ofile <- paste0(dPath,"processed/",saveFileName,".csv")
+print(paste0("Saving longFormG1DT to ", ofile))
+
+
+write_csv(keepWeightsDT, ofile)
+rawF <- file.size(ofile)
+
+print("Variables saved:")
+print(names(keepWeightsDT))
+
+# remove old gzip file
+cmd <- paste0("rm ", ofile, ".gz")
+try(system(cmd)) # will fail on first run
+# now gzip new one
+print(paste0("gzipping file to: ", ofile, ".gz"))
+cmd <- paste0("gzip ", ofile)
+try(system(cmd)) # in case it fails (it will on windows - you will be left with a .csv file)
+gzipF <- file.size(paste0(ofile, ".gz"))
+```
+
+That produced a file of `r ba_tidyNum(rawF/bytesToMbytes)` Mb (`r ba_tidyNum(gzipF/bytesToMbytes)` Mb (gzipped)).
+
+# Model 2 (Trial Group 2)
+
+In this model we repeat the process but using just Trial Group 2 (who received the event day notifications in TP1).
+
+## Run ipf
+
+```{r ipf model 2 TG 2}
+# as before but subset to Group 2 only
+ipfSurveyG2DT <- testSurveyDT[bmgGroup.respF == "BMG Group 2", constraintsList, with = FALSE]
+
+oaCodes <- oa2011consSAVEDT[, oaCode] # stays the same
+hhIdsG2DT <- testSurveyDT[bmgGroup.respF == "BMG Group 2", .(bmg_id)] # reduced list of ids
+
+t2 <- proc.time()
+print(paste0("Running ipf using ", nrow(ipfSurveyDT), " households for ", nrow(ipfConsDT), " areas."))
+
+print(paste0("Running ipf with: ", constraintsList))
+
+longFormG2DT <- ba_runIPF(ipfConsDTnz,ipfSurveyG2DT, hhIdsG2DT, oaCodes)
+t2tot <- proc.time() - t2
+elapsed <- t2tot[[3]]
+print(paste0("ipf 1: ", round(elapsed,2), " seconds"))
+
+kable(caption = "First few rows/columns of G2 ipf weights table", head(longFormG2DT))
+
+kable(caption = "Summary of G2 ipf weights table", summary(longFormG2DT, digits = 2))
+
+print(paste0("Dimensions of ipf weights table: ", ba_tidyNum(nrow(longFormG2DT)), " rows x ", ncol(longFormG2DT), " cols"))
+
+print("Number of OAs represented (TG2):")
+
+uniqueN(longFormG2DT$oaCode)
+
+```
+
+This produces a long form weights file as above but this time based on just `r uniqueN(longFormG2DT$bmg_id)` survey households.
+
+```{r Model 2 pick a test OA}
+setkey(longFormG2DT, bmg_id)
+setkey(testSurveyDT, bmg_id)
+# use merge to only keep the ones that match
+# and where weight != 0
+oa1G2DT <- merge(longFormG2DT[oaCode == "E00085919"],testSurveyDT[bmgGroup.respF == "BMG Group 2", .(bmg_id,ba_censusNpeople.latest, ba_Q22_HRPethnicity)])
+
+kable(caption = "First few rows of single OA table (BMG Group 2)", head(oa1G2DT))
+
+
+```
+
+The table of number of households in each group should be identical for the original survey data and the unweighted ipf results for the OA unless we have removed some due to deleting weights = 0. If so they are likely to be rare categories.
+
+```{r Model 2 test base tables}
+# the two base tables should be identical except where households get dropped as they have weight = 0
+kable(caption = "Unweighted survey (censusNpeople x ethnicity) - BMG Group 2", table(testSurveyDT[bmgGroup.respF == "BMG Group 2"]$ba_Q22_HRPethnicity, testSurveyDT[bmgGroup.respF == "BMG Group 2"]$ba_censusNpeople.latest, useNA = "always"))
+
+kable(caption = "Unweighted oa (censusNpeople x ethnicity) - BMG Group 2", table(oa1G2DT$ba_Q22_HRPethnicity, oa1G2DT$ba_censusNpeople.latest, useNA = "always"))
+
+```
+
+## Repeat systematic error tests
+
+First add survey data to the wieghted households and then test ethnicity vs household size as before.
+
+```{r ipf 2 - test all constraints}
+setkey(longFormG2DT, bmg_id)
+setkey(testSurveyDT, bmg_id)
+
+reqVars <- c("bmg_id","bmgGroup.respF","ba_Q22_HRPethnicity","ba_Q2B_HRPage.fullSurvey",
+"ba_Q2D_HRPemplType.latest","ba_censusNpeople.latest","ba_censusNcars",
+"ba_censusTenure","ba_presenceChildren.latest", constraintsList)
+
+# add in variables we care about using merge so it only keeps the survey respondents who went through the ipf process
+
+longFormG2SvyDT <- merge(longFormG2DT, testSurveyDT[, reqVars, with = FALSE]) # should just keep the ids that went through ipf
+
+#uniqueN(longFormG2DTSvyDT$bmg_id)
+
+# set svydesign
+longFormG2SvyDes <- svydesign(ids = ~bmg_id, # use synthetic hh_id
+weights = ~ipfWeight, 
+data = longFormG2SvyDT[!is.na(ipfWeight)] # make sure no NA weights
+)
+
+svyt <- svytable(~ba_Q22_HRPethnicity + ba_censusNpeople.latest, longFormG2SvyDes, 
+exclude=NULL, 
+na.action=na.pass
+)
+kable(caption = "Ethnicity vs n people (all households, all OAs, SAVE study area, weighted ipf results)", svyt)
+
+# total number of households
+nHhsIpf <- margin.table(svyt)
+nHhsCensus <- sum(oa2011consSAVEDT$nHhSpacesPeople)
+```
+
+Note that we get fractional counts as this is weighted data and we would expect more error as we have fewer numbers of households.
+
+Based on this example table we have a simulated household population of `r ba_tidyNum(nHhsIpf)` households compared to the overall Census count of `r ba_tidyNum(nHhsCensus)` households (based on ethnicity table).
+
+We now turn to more systematic validation tests.
+
+```{r model 2 calculate weighted counts for all constraints for all OAs}
+
+# build weighted counts for each constraint category
+tempDT <- ba_SAVEcalculateWeightedCounts(longFormG2SvyDes, longFormG2DT, constraintsList)
+setkey(tempDT, oaCode)
+saeTestDT <- merge(tempDT,oa2011consSAVEDT,suffixes = c(".ipf", ".census"))
+
+tempDT <- saeTestDT[, .(oaCode)]
+setkey(tempDT, oaCode)
+
+for(c in constraintsList){
+# loop through constraint sub-categories & calculate difference (error)
+saeDT <- saeTestDT[, .(diff = get(paste0(c, ".ipf")) - get(paste0(c, ".census"))), keyby = .(oaCode)]
+saeDT <- saeDT[, eval(c) := diff]
+saeDT <- saeDT[, diff := NULL]
+tempDT <- tempDT[saeDT]
+}
+
+kable(caption = "Summary of error table per HRP constraint category", t(summary(tempDT[, testHrp, with = FALSE], digits = 2)))
+
+kable(caption = "Summary of error table per household attribute constraint category", t(summary(tempDT[, testHH, with = FALSE], digits = 2)))
+```
+
+As we can see in some areas we get some quite large errors.
+
+## Save weights file
+
+```{r TG2_Control save weights}
+v <- paste0(version, "TG2")
+
+saveFileName <- paste0("SAVEipfHouseholdWeights_v", v)
+
+# keep only the variables we can't easily merge or recreate - e.g anything based on dateTimes (saves space)
+keepWeightsDT <- longFormG2DT[, .(
+  bmg_id,
+  oaCode,
+  ipfWeight)
+  ]
+
+# saveFileName is set in the calling wrapper to avoid writing over different extracts
+ofile <- paste0(dPath,"processed/",saveFileName,".csv")
+print(paste0("Saving longFormG2DT to ", ofile))
+
+
+write_csv(keepWeightsDT, ofile)
+rawF <- file.size(ofile)
+
+print("Variables saved:")
+print(names(keepWeightsDT))
+
+# remove old gzip file
+cmd <- paste0("rm ", ofile, ".gz")
+try(system(cmd)) # will fail on first run
+# now gzip new one
+print(paste0("gzipping file to: ", ofile, ".gz"))
+cmd <- paste0("gzip ", ofile)
+try(system(cmd)) # in case it fails (it will on windows - you will be left with a .csv file)
+gzipF <- file.size(paste0(ofile, ".gz"))
+```
+
+That produced a file of `r ba_tidyNum(rawF/bytesToMbytes)` Mb (`r ba_tidyNum(gzipF/bytesToMbytes)` Mb (gzipped)).
+
+
+
+
+# Model 3 (Trial Group 3)
+
+In this model we repeat the process but using just Trial Group 3 (who received the event day notifications in TP1).
+
+## Run ipf
+
+```{r ipf model 3 TG 3}
+# as before but subset to Group 3 only
+ipfSurveyG3DT <- testSurveyDT[bmgGroup.respF == "BMG Group 3", constraintsList, with = FALSE]
+
+oaCodes <- oa2011consSAVEDT[, oaCode] # stays the same
+hhIdsG3DT <- testSurveyDT[bmgGroup.respF == "BMG Group 3", .(bmg_id)] # reduced list of ids
+
+t2 <- proc.time()
+print(paste0("Running ipf using ", nrow(ipfSurveyDT), " households for ", nrow(ipfConsDT), " areas."))
+
+print(paste0("Running ipf with: ", constraintsList))
+
+longFormG3DT <- ba_runIPF(ipfConsDTnz,ipfSurveyG3DT, hhIdsG3DT, oaCodes)
+t2tot <- proc.time() - t2
+elapsed <- t2tot[[3]]
+print(paste0("ipf 1: ", round(elapsed,2), " seconds"))
+
+kable(caption = "First few rows/columns of G3 ipf weights table", head(longFormG3DT))
+
+kable(caption = "Summary of G3 ipf weights table", summary(longFormG3DT, digits = 2))
+
+print(paste0("Dimensions of G3 ipf weights table: ", ba_tidyNum(nrow(longFormG3DT)), " rows x ", ncol(longFormG3DT), " cols"))
+
+print("Number of OAs represented (TG3):")
+
+uniqueN(longFormG3DT$oaCode)
+```
+
+This produces a long form weights file as above but this time based on just `r uniqueN(longFormG3DT$bmg_id)` survey households.
+
+```{r Model 3 pick a test OA}
+setkey(longFormG3DT, bmg_id)
+setkey(testSurveyDT, bmg_id)
+# use merge to only keep the ones that match
+# and where weight != 0
+oa1G3DT <- merge(longFormG3DT[oaCode == "E00085919"],testSurveyDT[bmgGroup.respF == "BMG Group 3", .(bmg_id,ba_censusNpeople.latest, ba_Q22_HRPethnicity)])
+
+kable(caption = "First few rows of single OA table (BMG Group 3)", head(oa1G3DT))
+```
+
+The table of number of households in each group should be identical for the original survey data and the unweighted ipf results for the OA unless we have removed some due to deleting weights = 0. If so they are likely to be rare categories.
+
+```{r Model 3 test base tables}
+# the two base tables should be identical except where households get dropped as they have weight = 0
+kable(caption = "Unweighted survey (censusNpeople x ethnicity) - BMG Group 3", table(testSurveyDT[bmgGroup.respF == "BMG Group 3"]$ba_Q22_HRPethnicity, testSurveyDT[bmgGroup.respF == "BMG Group 3"]$ba_censusNpeople.latest, useNA = "always"))
+
+kable(caption = "Unweighted oa (censusNpeople x ethnicity) - BMG Group 3", table(oa1G3DT$ba_Q22_HRPethnicity, oa1G3DT$ba_censusNpeople.latest, useNA = "always"))
+
+```
+
+## Repeat systematic error tests
+
+First add survey data to the wieghted households and then test ethnicity vs household size as before.
+
+```{r ipf 3 - test all constraints}
+setkey(longFormG3DT, bmg_id)
+setkey(testSurveyDT, bmg_id)
+
+reqVars <- c("bmg_id","bmgGroup.respF","ba_Q22_HRPethnicity","ba_Q2B_HRPage.fullSurvey",
+"ba_Q2D_HRPemplType.latest","ba_censusNpeople.latest","ba_censusNcars",
+"ba_censusTenure","ba_presenceChildren.latest", constraintsList)
+
+# add in variables we care about using merge so it only keeps the survey respondents who went through the ipf process
+
+longFormG3SvyDT <- merge(longFormG3DT, testSurveyDT[, reqVars, with = FALSE]) # should just keep the ids that went through ipf
+
+#uniqueN(longFormG3DTSvyDT$bmg_id)
+
+# set svydesign
+longFormG3SvyDes <- svydesign(ids = ~bmg_id, # use synthetic hh_id
+weights = ~ipfWeight, 
+data = longFormG3SvyDT[!is.na(ipfWeight)] # make sure no NA weights
+)
+
+svyt <- svytable(~ba_Q22_HRPethnicity + ba_censusNpeople.latest, longFormG3SvyDes, 
+exclude=NULL, 
+na.action=na.pass
+)
+kable(caption = "Ethnicity vs n people (all households, all OAs, SAVE study area, weighted ipf results)", svyt)
+
+# total number of households
+nHhsIpf <- margin.table(svyt)
+nHhsCensus <- sum(oa2011consSAVEDT$nHhSpacesPeople)
+```
+
+Note that we get fractional counts as this is weighted data and we would expect more error as we have fewer numbers of households.
+
+Based on this example table we have a simulated household population of `r ba_tidyNum(nHhsIpf)` households compared to the overall Census count of `r ba_tidyNum(nHhsCensus)` households (based on ethnicity table).
+
+We now turn to more systematic validation tests.
+
+```{r model 3 calculate weighted counts for all constraints for all OAs}
+
+# build weighted counts for each constraint category
+tempDT <- ba_SAVEcalculateWeightedCounts(longFormG3SvyDes, longFormG3DT, constraintsList)
+setkey(tempDT, oaCode)
+saeTestDT <- merge(tempDT,oa2011consSAVEDT,suffixes = c(".ipf", ".census"))
+
+tempDT <- saeTestDT[, .(oaCode)]
+setkey(tempDT, oaCode)
+
+for(c in constraintsList){
+# loop through constraint sub-categories & calculate difference (error)
+saeDT <- saeTestDT[, .(diff = get(paste0(c, ".ipf")) - get(paste0(c, ".census"))), keyby = .(oaCode)]
+saeDT <- saeDT[, eval(c) := diff]
+saeDT <- saeDT[, diff := NULL]
+tempDT <- tempDT[saeDT]
+}
+
+kable(caption = "Summary of error table per HRP constraint category", t(summary(tempDT[, testHrp, with = FALSE], digits = 2)))
+
+kable(caption = "Summary of error table per household attribute constraint category", t(summary(tempDT[, testHH, with = FALSE], digits = 2)))
+```
+
+As we can see in some areas we get some quite large errors.
+
+## Save weights file
+
+```{r TG3_Control save weights}
+v <- paste0(version, "TG3")
+
+saveFileName <- paste0("SAVEipfHouseholdWeights_v", v)
+
+# keep only the variables we can't easily merge or recreate - e.g anything based on dateTimes (saves space)
+keepWeightsDT <- longFormG3DT[, .(
+  bmg_id,
+  oaCode,
+  ipfWeight)
+  ]
+
+# saveFileName is set in the calling wrapper to avoid writing over different extracts
+ofile <- paste0(dPath,"processed/",saveFileName,".csv")
+print(paste0("Saving longFormG3DT to ", ofile))
+
+
+write_csv(keepWeightsDT, ofile)
+rawF <- file.size(ofile)
+
+print("Variables saved:")
+print(names(keepWeightsDT))
+
+# remove old gzip file
+cmd <- paste0("rm ", ofile, ".gz")
+try(system(cmd)) # will fail on first run
+# now gzip new one
+print(paste0("gzipping file to: ", ofile, ".gz"))
+cmd <- paste0("gzip ", ofile)
+try(system(cmd)) # in case it fails (it will on windows - you will be left with a .csv file)
+gzipF <- file.size(paste0(ofile, ".gz"))
+```
+
+That produced a file of `r ba_tidyNum(rawF/bytesToMbytes)` Mb (`r ba_tidyNum(gzipF/bytesToMbytes)` Mb (gzipped)).
+
+# Summary
+
+Well it seems to work but the models based on the trial groups are (as expected) strongly affected by a few households with 'odd' habits. Solutions to this might include:
+
+ * excluding them
+ * err
+
+```{r run to here}
+```
+
+
+# Runtime
+
+
+```{r check runtime}
+t <- proc.time() - startTime
+
+elapsed <- t[[3]]
+```
+
+Analysis completed in `r elapsed` seconds ( `r round(elapsed/60,2)` minutes).
+
+```{r run to here}
+```
+
+# References