Ben Anderson
--- a/Rmd/cleaningFeederData.Rmd

+ 28

− 25
+++ b/Rmd/cleaningFeederData.Rmd

+ 28

− 25
 @@ -9,25 +9,24 @@ author: '`r params$authors`'
 date: 'Last run at: `r Sys.time()`'
 output:
  bookdown::html_document2:
-    code_folding: hide
+    self_contained: TRUE
    fig_caption: yes
+    code_folding: hide
    number_sections: yes
-    self_contained: yes
-    toc: yes
-    toc_depth: 3
-    toc_float: yes
-  bookdown::word_document2:
-    fig_caption: yes
    toc: yes
    toc_depth: 2
+    toc_float: TRUE
  bookdown::pdf_document2:
    fig_caption: yes
-    keep_tex: yes
+    number_sections: yes
+  bookdown::word_document2:
+    fig_caption: yes
    number_sections: yes
    toc: yes
    toc_depth: 2
+    fig_width: 5
 always_allow_html: yes
-bibliography: '`r path.expand("~/bibliography.bib")`'
+bibliography: '`r paste0(here::here(), "/bibliography.bib")`'
 ---

 ```{r setup}
 @@ -73,8 +72,12 @@ Loaded data from `r dFile`... (using drake)

 ```{r loadData}
 origDataDT <- drake::readd(origData) # readd the drake object
-head(origDataDT)
+
 uniqDataDT <- drake::readd(uniqData) # readd the drake object
+
+kableExtra::kable(head(origDataDT), digits = 2,
+                  caption = "Counts per feeder (long table)") %>%
+  kable_styling()
 ```

 Check data prep worked OK.
 @@ -82,8 +85,8 @@ Check data prep worked OK.
 ```{r dataPrep}
 # check
 t <- origDataDT[, .(nObs = .N,
-                  firstDate = min(rDateTime),
-                  lastDate = max(rDateTime),
+                  firstDate = min(rDateTime, na.rm = TRUE),
+                  lastDate = max(rDateTime, na.rm = TRUE),
                  meankW = mean(kW, na.rm = TRUE)
 ), keyby = .(region, feeder_ID)]

 @@ -104,7 +107,7 @@ message("So we have ", tidyNum(nrow(origDataDT) - nrow(uniqDataDT)), " duplicate
 pc <- 100*((nrow(origDataDT) - nrow(uniqDataDT))/nrow(origDataDT))
 message("That's ", round(pc,2), "%")

-feederDT <- uniqDataDT # use dt with no duplicates
+feederDT <- uniqDataDT[!is.na(rDateTime)] # use dt with no duplicates
 origDataDT <- NULL # save memory
 ```

 @@ -114,7 +117,7 @@ So we remove the duplicates...

 Try aggregated demand profiles of mean kW by season and feeder and day of the week... Remove the legend so we can see the plot.

-```{r kwProfiles}
+```{r kwProfiles, fig.width=8}
 plotDT <- feederDT[, .(meankW = mean(kW),
                       nObs = .N), keyby = .(rTime, season, feeder_ID, rDoW)]

 @@ -131,7 +134,7 @@ Is that what we expect?

 Number of observations per feeder per day - gaps will be visible (totally missing days) as will low counts (partially missing days) - we would expect 24 * 4... Convert this to a % of expected...

-```{r basicCountTile, fig.height=10}
+```{r basicCountTile, fig.height=10, fig.width=8}
 plotDT <- feederDT[, .(nObs = .N), keyby = .(rDate, feeder_ID)]
 plotDT[, propExpected := nObs/(24*4)]

 @@ -148,7 +151,7 @@ This is not good. There are both gaps (missing days) and partial days. **Lots**

 What does it look like if we aggregate across all feeders by time? There are `r uniqueN(feederDT$feeder_ID)` feeders so we should get this many at best How close do we get?

-```{r aggVisN}
+```{r aggVisN, fig.width=8}

 plotDT <- feederDT[, .(nObs = .N,
                       meankW = mean(kW)), keyby = .(rTime, rDate, season)]
 @@ -167,7 +170,7 @@ That really doesn't look too good. There are some very odd fluctuations in there

 What do the mean kw patterns look like per feeder per day?

-```{r basickWTile, fig.height=10}
+```{r basickWTile, fig.height=10, fig.width=8}
 plotDT <- feederDT[, .(meankW = mean(kW, na.rm = TRUE)), keyby = .(rDate, feeder_ID)]

 ggplot2::ggplot(plotDT, aes(x = rDate, y = feeder_ID, fill = meankW)) +
 @@ -183,7 +186,7 @@ Missing data is even more clearly visible.

 What about mean kw across all feeders?

-```{r aggViskW}
+```{r aggViskW, fig.width=8}

 plotDT <- feederDT[, .(nObs = .N,
                       meankW = mean(kW)), keyby = .(rTime, rDate, season)]
 @@ -213,7 +216,7 @@ summary(dateTimesDT)

 Let's see how many unique feeders we have per dateTime. Surely we have at least one sending data each half-hour?

-```{r tileFeeders}
+```{r tileFeeders, fig.width=8}
 ggplot2::ggplot(dateTimesDT, aes(x = rDate, y =  rTime, fill = nFeeders)) +
  geom_tile() +
  scale_fill_viridis_c() +
 @@ -224,7 +227,7 @@ No. As we suspected from the previous plots, we clearly have some dateTimes wher

 Are there time of day patterns? It looks like it...

-```{r missingProfiles}
+```{r missingProfiles, fig.width=8}
 dateTimesDT[, rYear := lubridate::year(rDateTime)]
 plotDT <- dateTimesDT[, .(meanN = mean(nFeeders),
                          meankW = mean(meankW)), keyby = .(rTime, season, rYear)]
 @@ -240,7 +243,7 @@ Oh yes. After 2003. Why?

 What about the kW?

-```{r kWProfiles}
+```{r kWProfiles, fig.width=8}

 ggplot2::ggplot(plotDT, aes(y = meankW, x = rTime, colour = season)) +
  geom_line() +
 @@ -251,7 +254,7 @@ ggplot2::ggplot(plotDT, aes(y = meankW, x = rTime, colour = season)) +

 Those look as we'd expect. But do we see a correlation between the number of observations per hour and the mean kW after 2003? There is a suspicion that as mean kw goes up so do the number of observations per hour... although this could just be a correlation with low demand periods (night time?)

-```{r compareProfiles}
+```{r compareProfiles, fig.width=8}
 ggplot2::ggplot(plotDT, aes(y = meankW, x = meanN, colour = season)) +
  geom_point() +
  facet_wrap(rYear ~ .) +
 @@ -278,7 +281,7 @@ The wide dataset has a count of NAs per row (dateTime) from which we infer how m

 ```{r}
 wDT <- drake::readd(wideData) # back from the drake
-head(wDT)
+names(wDT)
 ```

 If we take the mean of the number of feeders reporting per day (date) then a value of 25 will indicate a day when _all_ feeders have _all_ data (since it would be the mean of all the '25's).
 @@ -307,13 +310,13 @@ nrow(aggDT[propExpected == 1])

 If we plot the mean then we will see which days get closest to having a full dataset.

-```{r bestDaysMean}
+```{r bestDaysMean, fig.width=8}
 ggplot2::ggplot(aggDT, aes(x = rDate, colour = season, y = meanOK)) + geom_point()
 ```

 Re-plot by the % of expected if we assume we _should_ have 25 feeders * 24 hours * 4 per hour (will be the same shape):

-```{r bestDaysProp}
+```{r bestDaysProp, fig.width=8}
 ggplot2::ggplot(aggDT, aes(x = rDate, colour = season, y = 100*propExpected)) + geom_point() +
  labs(y = "%")
 ```