version 1 of census 2013 meshblock data file processing: runs tests, saves MB...

version 1 of census 2013 meshblock data file processing: runs tests, saves MB file and aggreates to & saves AU file

version 1 of census 2013 meshblock data file processing: runs tests, saves MB...
12bcf187 · Ben Anderson · e6afdf2c · 12bcf187 · 12bcf187 · 12bcf187
Commit 12bcf187 authored Feb 6, 2019 by Ben Anderson
--- a/dataProcessing/nzCensus2013Test.Rmd
+++ b/dataProcessing/nzCensus2013Test.Rmd
@@ -27,7 +27,9 @@ myPackages <- c("readr",
              "data.table",
              "skimr",
              "ggplot2",
-              "plotly")
+              "plotly", 
+              "dkUtils",
+              "kableExtra")

 spatialec::loadLibraries(myPackages)

@@ -38,12 +40,14 @@ root <- spatialec::findParentDirectory("spatialec")
 source(paste0(root, "/_scripts/spatialecSettings.R"))   # loads knitr options & assigns file paths and loads system info

 # Print system information
-outputMessage("Running on", lParams$sysName, "with projLoc =", root)   # sysName is set in spatialecSettings.R
+outputMessage("Running on", lParams$sysName, "from projLoc =", root)   # sysName is set in spatialecSettings.R
 ```

 # Report Purpose

-To load, test & process NZ Census data at meshblock and other levels as preparation for a spatial microsimulation model.
+To load, test & process NZ Census data at meshblock and other levels as preparation for a spatial microsimulation model. Data source:
+
+ * http://archive.stats.govt.nz/Census/2013-census/data-tables/meshblock-dataset.aspx

 # Load data

@@ -58,102 +62,320 @@ kableExtra::kable(dFiles, caption = "Census 2013 Mesh block level files")

 In this report we focus on households/families/dwellings not individuals as the spatial microsimulation will operate at the household level.

-## Dwellings
+Prelimniary inspection of the files shows that they have:

-```{r loadDwellings}
-nzCensusDwellingsDF <- readr::read_csv(paste0(lParams$dataPath,"raw/2013-mb-dataset-Total-New-Zealand-Dwelling.csv")) 
+ * data for several different levels (meshblocks, area units, regions etc) in the same file
+ * data from different years in the same file
+ * a wide format which makes manipulation more difficult in R

-nzCensusDwellingsDT <- data.table::as.data.table(nzCensusDwellingsDF)
+To make life easier, we next:

-nzCensusDwellingsDT <- nzCensusDwellingsDT[, codeInt := as.integer(Code)] # set code to int for ordering
+ * load all the data files and merge them on the area `Code` producing a very wide file;
+ * convert this file to a long form;
+ * add a `year` variable for easier aggregation;
+ * do various other data cleaning

-# Meshblocks
-meshblocksDT <- nzCensusDwellingsDT[Area_Code_and_Description %like% "MB " & is.na(Description)]
-meshblocksDT <- meshblocksDT[, codeInt := as.integer(Code)] # set code to int for ordering
-nMeshBlocks <- nrow(meshblocksDT)
+```{r loadData}
+f <- paste0(lParams$dataPath,"raw/2013-mb-dataset-Total-New-Zealand-Dwelling.csv")
+outputMessage(paste0("Loading: ", f))
+dwDT <- spatialec::loadData(f)
+outputMessage(paste0("N rows loaded: ", nrow(dwDT)))

-# Districts & Cities
-districtsDT <- nzCensusDwellingsDT[Description %like% "District" |
-                                         Description %like% "City" |
-                                         Description %like% "Territory" |
-                                         Description == "Auckland"] # just to be difficult
-districtsDT <- districtsDT[nchar(Code) < 4] # remove any that slipped through but which are not districts
+f <- paste0(lParams$dataPath,"raw/2013-mb-dataset-Total-New-Zealand-Family.csv")
+outputMessage(paste0("Loading: ", f))
+faDT <- spatialec::loadData(f)
+outputMessage(paste0("N rows loaded: ", nrow(faDT)))

-# Regions
-regionsDT <- nzCensusDwellingsDT[Description %like% "Region" & !(Description %like% "Oceanic")]
+f <- paste0(lParams$dataPath,"raw/2013-mb-dataset-Total-New-Zealand-Household.csv")
+outputMessage(paste0("Loading: ", f))
+hhDT <- spatialec::loadData(f)
+outputMessage(paste0("N rows loaded: ", nrow(hhDT)))
+```

-nRegions <- nrow(regionsDT)

-# Oceanic regions (offshore?)
-oceanicRegionsDT <- nzCensusDwellingsDT[Description %like% "Oceanic"]
-nOceanicRegions <- nrow(oceanicRegionsDT)
+```{r mergeDataFiles}
+# first merge them to get a mondo wide file
+outputMessage("Merging loaded files")

-# Auckland Board areas
-aucklandBoardAreasDT <- nzCensusDwellingsDT[Description %like% "Board Area"]
-nBoardAreas <- nrow(aucklandBoardAreasDT)
+# remove cols we don't need first - saves memory but also stops melt getting confused later
+faDT <- faDT[, c("Area_Code_and_Description", "Description", "Code") := NULL]
+hhDT <- hhDT[, c("Area_Code_and_Description", "Description", "Code") := NULL]
+dt <- dwDT[faDT][hhDT]

-nZones <- "? how to filter ?"
 ```

-The first (slightly suprising) things to note are:
+NB in order to get the label text re-coding to work we had to remove some strange multi-byte characters from the column names of the Family file (income range columns).
+
+Next we add the area level labels for 2013 using the relevant areas file. First we check the area level file (http://archive.stats.govt.nz/browse_for_stats/Maps_and_geography/Geographic-areas/geographic-area-files.aspx#2013) to see how many meshblocks we should expect.
+
+```{r loadAreas}
+areasDT <- fread(paste0(lParams$gisPath, "2013/2013_Areas_Table.txt"))
+setkey(areasDT, MB2013_code)
+```
+
+The area level file has:
+
+ * MB 2001: `r dkUtils::tidyNum(uniqueN(areasDT$MB2001_code))` unique meshblock codes
+ * MB 2006: `r dkUtils::tidyNum(uniqueN(areasDT$MB2006_code))` unique meshblock codes
+ * MB 2013: `r dkUtils::tidyNum(uniqueN(areasDT$MB2013_code))` unique meshblock codes
+
+```{r addAreaLabels}
+# do this using the 2013 code - will have the side effect of dropping all non MB rows
+# just keep a list of 2013 meshblocks - we can add the other on later
+mb2013ListDT <- areasDT[, .(MB2013_code, UA2013_code, UA2013_label, WARD2013_code, WARD2013_label, REGC2013_label)] 
+setkey(mb2013ListDT, MB2013_code)
+
+# the census data file has meshblocks and all sorts in it. We just want the meshblocks.
+# Luckily the Area_Code_and_Description column can help
+dt <- dt[Area_Code_and_Description %like% "MB ", 
+         MB2013_code := substr(Area_Code_and_Description, 
+                               4, # i.e. chops off "MB "
+                               nchar(Area_Code_and_Description)
+                               )
+         ] # creates a new column with our imputed meshblock 2013 code
+
+# how many did we get?
+message("Unique meshblock codes in the census dataset: ", dkUtils::tidyNum(uniqueN(dt$MB2013_code)))

- * the data file has 2001, 2006 and 2013 data
- * all levels of geography are included (meshblocks, zones, regions, Auckland Board Areas)
- * there are `r nMeshBlocks` meshblocks
- * there are `r nZones` zones
- * there are `r nRegions` regions
- * there are `r nOceanicRegions` Oceanic regions
- * there are `r nBoardAreas` Auckland Board areas
+# set to integer for matching to areasDT
+dt <- dt[,MB2013_code := as.integer(MB2013_code)]
+# we can now extract all rows that are 2013 meshblock data
+meshblocksDT <- dt[!is.na(MB2013_code)]
+setkey(meshblocksDT, MB2013_code)
+# how many did we get?
+message("Unique meshblock codes extracted from the census dataset: ", dkUtils::tidyNum(uniqueN(meshblocksDT$MB2013_code)))

-```{r regionsList}
-kableExtra::kable(caption = "Regions", 
-                  regionsDT[, .(Area_Code_and_Description,
-                                privDwellings2001 = `2001_Census_occupied_private_dwelling_type_Total_occupied_private_dwellings`,
-                                privDwellings2006 = `2006_Census_occupied_private_dwelling_type_Total_occupied_private_dwellings`,
-                                privDwellings2013 = `2013_Census_occupied_private_dwelling_type_Total_occupied_private_dwellings`)][order(regionsDT$codeInt)])
+# merge our meshblocks data to the area labels
+
+system.time(
+  labeledDT <- mb2013ListDT[meshblocksDT]
+  ) # this will drop anything that does not match a 2013 meshblock code in mb2013ListDT
+
+# how many did we get?
+message("Unique meshblock codes extracted from the census dataset and linked to labels: ", dkUtils::tidyNum(uniqueN(labeledDT$MB2013_code)))
+
+# how many did we get?
+message("N rows in the labelled census 2013 meshblocks dataset: ", dkUtils::tidyNum(nrow(labeledDT)))
+
+# reminder...
+message("Reminder: N census 2013 meshblocks in the areas file = ", dkUtils::tidyNum(uniqueN(areasDT$MB2013_code)))
 ```

-Population (private dwelling) growth has clearly been concentrated in some places more than others - see Figure \@ref(fig:regionDwellingsPlot).
+If the last two values are not the same then we must have duplicated Census 2013 meshblock labels in the data file. This might happen if a mesh block from 2001 or 2006 was split into several 2013 meshblocks due to population growth. We need to test this and remove duplicates otherwise there will be errors in our 2013 aggregate counts.
+
+```{r testDuplicate2013Meshblocks}
+
+# test for duplicates 
+labeledDT <- labeledDT[, dups := duplicated(MB2013_code)]
+
+kableExtra::kable(table(labeledDT$dups), caption = "Do we have duplicated MB2013_codes?")%>%
+  kable_styling()
+
+message("We do indeed have duplicates. We need to check the 2013 data is also duplicated for these rows so we")
+message("pick a good 2013 data cell - total occupied dwellings should always have data - check.")
+summary(labeledDT$`2013_Census_dwelling_record_type_for_occupied_dwellings_Total_occupied_dwellings`)
+
+labeledDT <- labeledDT[, dataDups := duplicated(MB2013_code, `2013_Census_dwelling_record_type_for_occupied_dwellings_Total_occupied_dwellings`)]
+
+kableExtra::kable(table(labeledDT$dataDups), caption = "Do we have duplicated MB2013_code data (N occupied dwellings as an example)?")%>%
+  kable_styling()
+
+message("Looks like it, so we need to keep just the non-duplicated rows")

-```{r regionDwellingsPlot, fig.cap="Number of occupied private dwellings by region (000s)"}
+labeledDT$dataDups <- NULL
+labeledDT$dups <- NULL
+
+meshblocksDT <- unique(labeledDT, by = "MB2013_code")
+labeledDT <- NULL # drop
+message("How many rows do we have now? ", dkUtils::tidyNum(nrow(meshblocksDT)))
+message("How many unique 2013 meshblock codes do we have now? ", dkUtils::tidyNum(uniqueN(meshblocksDT$MB2013_code)))
+
+```
+
+
+Now we convert the wide form data to long to make data manipulation and aggregation easier.
+
+```{r meltData}
+# now use the fast data.table melt() function to convert to long form
+outputMessage("Converting merged wide file to long form")
+# drop all the labels vars - we can re-add when needed
+meshblocksDT <- meshblocksDT[, c("UA2013_code", "UA2013_label", "WARD2013_code", "WARD2013_label", 
+                                 "REGC2013_label", "Code", "code.Int", "Description") := NULL]
+system.time(longMbDT <- data.table::melt(meshblocksDT, 
+                                       id.vars = c("Area_Code_and_Description", "MB2013_code"),
+                                       variable.name = "variable_orig", 
+                                       value.name = "value")
+            )
+message("How many unique 2013 meshblock codes do we have in the long form data? ", dkUtils::tidyNum(uniqueN(longMbDT$MB2013_code)))
+
+```
+
+Now fix the variable label for ease of aggregation.
+
+```{r fixData}
+# now we have long form data.table so many other things are easy
+outputMessage("Set years")
+system.time(longMbDT <- spatialec::setYears(longMbDT))
+
+# now we can chop off the number from the start of the variable text to avoid confusion and make
+# for easier aggregation
+outputMessage("Re-code count variable labels to remove leading redundant text - 20XX_Census_")
+longMbDT <- longMbDT[, varText := as.character(variable_orig)] # just to be sure
+system.time(longMbDT <- spatialec::fixCountLabels(longMbDT))
+longMbDT <- longMbDT[, varText := NULL] # no longer needed
+
+# Now we have:
+st <- head(longMbDT,20)
+
+kableExtra::kable(st, caption = "Initial long form data (first 10 rows)") %>%
+  kable_styling()
+
+# set the key for faster extraction
+setkey(longMbDT, year, countLabel)
+
+```
+
+We will still have `..C` values where data has been redacted and `*` values where the count was not able to be calculated. These will cause data aggregating problems so we need to set them to NA and make sure the counts are numeric (not characters).
+
+```{r fixCounts}
+st <- summary(longMbDT)
+kableExtra::kable(st, caption = "Summary of long form data before ..C & * fixed") %>%
+  kable_styling()
+
+system.time(longMbDT <- longMbDT[, valueAsNum := as.numeric(value)]) # ..C & * get set to NA
+
+st <- summary(longMbDT)
+kableExtra::kable(st, caption = "Summary of long form data after ..C & * fixed") %>%
+  kable_styling()
+```

-plotDT <- regionsDT[, .(Area_Code_and_Description,
-                                privDwellings2001 = `2001_Census_occupied_private_dwelling_type_Total_occupied_private_dwellings`,
-                                privDwellings2006 = `2006_Census_occupied_private_dwelling_type_Total_occupied_private_dwellings`,
-                                privDwellings2013 = `2013_Census_occupied_private_dwelling_type_Total_occupied_private_dwellings`)]
+## Data tests

-dt <- reshape2::melt(plotDT)
-dt <- dt[variable %like% "2001", year := 2001]
-dt <- dt[variable %like% "2006", year := 2006]
-dt <- dt[variable %like% "2013", year := 2013]
+We can now run some test aggregation at the whole population level. Ideally the modelling requires exact counts which sum across categories to the same total (number of dwellings, households or families). However, it is quite likely that the number of households/dwellings/families will differ at the meshblock and aggregate levels due to disclosure control which may have:

-p <- ggplot2::ggplot(dt, aes(x = year, y = value/1000, colour = Area_Code_and_Description, group = Area_Code_and_Description)) +
+ * rounded cell counts;
+ * swapped records;
+ * not able to be completed (originally represented by the ‘*’ symbol);
+ * completely redacted cell counts (originally represented by the ‘..C’ symbol).
+
+As a result the aggregations below will not exactly match the official published figures. 
+
+```{r totalsTests}
+setkey(longMbDT, countLabel)
+# create test dt for simplicity
+tdt <- longMbDT[countLabel == "occupied_private_dwelling_type_Total_occupied_private_dwellings"|
+                  countLabel == "total_families_in_occupied_private_dwellings"|
+                  countLabel == "total_households_in_occupied_private_dwellings"]
+st <- tdt[,.(Total = sum(valueAsNum, na.rm = TRUE)), keyby = .(year, countLabel)]
+
+kableExtra::kable(st, caption = "Total counts for selection of 'totals' - mesh blocks") %>%
+  kable_styling()
+
+ggplot2::ggplot(st, aes(x = year, y = Total/1000000, colour = countLabel)) +
  geom_line() +
-  labs(y = "N (000s)")
+  theme(legend.position = "bottom") +
+  guides(colour=guide_legend(ncol=1, title = "Variable")) +
+  labs(y = "Total (m)")
+```
+
+According to the original census data the actual totals (without redaction) are:
+
+ * Dwellings (occupied_private_dwelling_type_Total_occupied_private_dwellings): 1561959
+ * Families (total_families_in_occupied_private_dwellings) : 1136397
+ * Households (total_households_in_occupied_private_dwellings): 1549890
+ 
+We now repeat the test aggregation but at the regional level using the linked area labels. Note that these aggregations will also not exactly match the official published figures due to the data redaction described above.
+
+```{r totalsTestsByRegion}
+setkey(tdt, MB2013_code)
+
+tdt <- areasDT[tdt]
+
+setkey(tdt, countLabel, REGC2013_label)
+
+st <- tdt[,.(Total = sum(valueAsNum)), keyby = .(year, countLabel, REGC2013_label)]
+
+p <- ggplot2::ggplot(st, aes(x = year, y = Total/1000, colour = REGC2013_label)) +
+  geom_line() +
+  facet_grid(countLabel ~ .) +
+  guides(colour=guide_legend(title = "Region")) +
+  labs(y = "Total (,000s)")
+
+p

 plotly::ggplotly(p)
 ```

+These trends look fairly reasonable...
+
+## Save processed data
+
+Most of the modelling work requires meshblock 2013 data so we extract just this subset.
+
+> XX Nb some values are counts and some are (median) income values XX
+
+```{r save2013MB}
+oFile <- paste0(lParams$dataPath, "processed/meshblock2013.csv")
+outputMessage("Saving meshblocks file to ", oFile)
+mb2013DT <- longMbDT[year == 2013, .(MB2013_code, year, variable = countLabel, value = valueAsNum)]
+data.table::fwrite(mb2013DT, oFile) # only keep final useful variables
+message("Saved ", dkUtils::tidyNum(nrow(mb2013DT)), " rows of data.")
+message("Variables saved: ")
+summary(mb2013DT)
+message("Remember that some values might be (median) income values not counts... NAs are what used to be ..C or *")
+nr <- nrow(mb2013DT)
+nna <- nrow(mb2013DT[is.na(value)])
+pcna <- 100*(nna/nr)
+message("Of the ", dkUtils::tidyNum(nrow(mb2013DT)), " value cells ", round(pcna,2), "% are NA... Let's hope they are not the variables we need!")
+
+outputMessage("Compressing saved data file")
+dkUtils::gzipIt(oFile)
+```
+
+ 
+## Aggregation to area units
+
+Aggregation to area units will enable the modelling to run faster (fewer units to model) and will also partially resolve the redacted data problem although it will not infill the missing counts, just hopefully prevent there being as many cells with NA. To do this we add the area unit codes and aggregate (sum) the count variables ignoring the NA. 
+
+This will lead to some error (undercounting). Note also that we cannot meaninguly aggregate the income variables in this way.
+
+```{r aggToAU}
+mb2013DT <- areasDT[mb2013DT]
+
+au2013DT <- mb2013DT[, .(value = sum(value, na.rm = TRUE)), keyby = .(AU2013_code,AU2013_label,REGC2013_label, variable,year)]
+
+message("That gives ", dkUtils::tidyNum(uniqueN(au2013DT$AU2013_code)), " area units.")
+
+summary(au2013DT)
+
+message("Remember that some values might be (median) income values not counts... so we should not have summed them",
+        "NAs are what used to be ..C or *")
+nr <- nrow(au2013DT)
+nna <- nrow(au2013DT[is.na(value)])
+pcna <- 100*(nna/nr)
+message("Of the ", dkUtils::tidyNum(nrow(au2013DT)), " value cells ", round(pcna,2), "% are NA...")

-```{r oceanicRegionsList}
-kableExtra::kable(caption = "Oceanic egions", 
-                  oceanicRegionsDT[, .(Area_Code_and_Description,
-                                privDwellings2001 = `2001_Census_occupied_private_dwelling_type_Total_occupied_private_dwellings`,
-                                privDwellings2006 = `2006_Census_occupied_private_dwelling_type_Total_occupied_private_dwellings`,
-                                privDwellings2013 = `2013_Census_occupied_private_dwelling_type_Total_occupied_private_dwellings`)][order(oceanicRegionsDT$codeInt)])
 ```

+```{r totalsTestsAU}
+# create test dt for simplicity
+tdt <- au2013DT[variable == "occupied_private_dwelling_type_Total_occupied_private_dwellings"|
+                  variable == "total_families_in_occupied_private_dwellings"|
+                  variable == "total_households_in_occupied_private_dwellings"]
+st <- tdt[,.(Total = sum(value)), keyby = .(year, variable)]

-```{r districtsList}
-kableExtra::kable(caption = "Districts & Cities", 
-                  districtsDT[, .(Area_Code_and_Description,
-                                privDwellings2001 = `2001_Census_occupied_private_dwelling_type_Total_occupied_private_dwellings`,
-                                privDwellings2006 = `2006_Census_occupied_private_dwelling_type_Total_occupied_private_dwellings`,
-                                privDwellings2013 = `2013_Census_occupied_private_dwelling_type_Total_occupied_private_dwellings`)][order(districtsDT$codeInt)])
+kableExtra::kable(st, caption = "Total counts for selection of 'totals' - area units") %>%
+  kable_styling()
 ```

+## Aggregation to IMD zones
+
+The NZ Index of Mutiple Deprivation uses datazones which are aggregates of meshblocks...
+
+
 # About

+
 ```{r check runtime}
 t <- proc.time() - startTime

@@ -177,10 +399,25 @@ Analysis completed in `r elapsed` seconds ( `r round(elapsed/60,2)` minutes) usi

 # Annex: Codebooks

+These are run on the original wide format files so each variable can be interrogated.
 ## Dwellings

 ```{r dwellingsCb}
-Hmisc::describe(nzCensusDwellingsDF)
+Hmisc::describe(dwDT)
+```
+
+
+## Families
+
+```{r familiesCb}
+Hmisc::describe(faDT)
 ```

+## Households
+
+```{r householdsCb}
+Hmisc::describe(hhDT)
+```
+
+
 # References
\ No newline at end of file
--- a/dataProcessing/nzCensus2013Test.html
+++ b/dataProcessing/nzCensus2013Test.html
--- a/dataProcessing/nzCensus2013Test_files/bootstrap-3.3.5/css/darkly.min.css
+++ b/dataProcessing/nzCensus2013Test_files/bootstrap-3.3.5/css/darkly.min.css
--- a/dataProcessing/nzCensus2013Test_files/figure-html/regionDwellingsPlot-1.png
+++ b/dataProcessing/nzCensus2013Test_files/figure-html/regionDwellingsPlot-1.png
--- a/dataProcessing/nzCensus2013Test_files/figure-html/totalsTests-1.png
+++ b/dataProcessing/nzCensus2013Test_files/figure-html/totalsTests-1.png
--- a/dataProcessing/nzCensus2013Test_files/figure-html/totalsTestsByRegion-1.png
+++ b/dataProcessing/nzCensus2013Test_files/figure-html/totalsTestsByRegion-1.png
--- a/dataProcessing/nzCensus2013Test_files/kePrint-0.0.1/kePrint.js
+++ b/dataProcessing/nzCensus2013Test_files/kePrint-0.0.1/kePrint.js
+$(document).ready(function(){
+    if (typeof $('[data-toggle="tooltip"]').tooltip === 'function') {
+        $('[data-toggle="tooltip"]').tooltip();
+    }
+    if ($('[data-toggle="popover"]').popover === 'function') {
+        $('[data-toggle="popover"]').popover();
+    }
+});