fixed bib (was absent); html no longer self-contained so plots exist...

fixed bib (was absent); html no longer self-contained so plots exist seperately; floating ToC still broken in html (why?); pdf fails (probabky table too big)

fixed bib (was absent); html no longer self-contained so plots exist...
4d734d62 · B.Anderson · c11387e4 · 4d734d62 · 4d734d62 · 4d734d62
Commit 4d734d62 authored 4 years ago by B.Anderson
--- a/Rmd/cleaningFeederData.Rmd
+++ b/Rmd/cleaningFeederData.Rmd
@@ -9,25 +9,24 @@ author: '`r params$authors`'
 date: 'Last run at: `r Sys.time()`'
 output:
  bookdown::html_document2:
-    code_folding: hide
+    self_contained: TRUE
    fig_caption: yes
+    code_folding: hide
    number_sections: yes
-    self_contained: yes
-    toc: yes
-    toc_depth: 3
-    toc_float: yes
-  bookdown::word_document2:
-    fig_caption: yes
    toc: yes
    toc_depth: 2
+    toc_float: TRUE
  bookdown::pdf_document2:
    fig_caption: yes
-    keep_tex: yes
+    number_sections: yes
+  bookdown::word_document2:
+    fig_caption: yes
    number_sections: yes
    toc: yes
    toc_depth: 2
+    fig_width: 5
 always_allow_html: yes
-bibliography: '`r path.expand("~/bibliography.bib")`'
+bibliography: '`r paste0(here::here(), "/bibliography.bib")`'
 ---
 ```{r setup}
@@ -73,8 +72,12 @@ Loaded data from `r dFile`... (using drake)
 ```{r loadData}
 origDataDT <- drake::readd(origData) # readd the drake object
-head(origDataDT)
 uniqDataDT <- drake::readd(uniqData) # readd the drake object
+kableExtra::kable(head(origDataDT), digits = 2,
+                  caption = "Counts per feeder (long table)") %>%
+  kable_styling()
 ```
 Check data prep worked OK.
@@ -82,8 +85,8 @@ Check data prep worked OK.
 ```{r dataPrep}
 # check
 t <- origDataDT[, .(nObs = .N,
-                  firstDate = min(rDateTime),
+                  firstDate = min(rDateTime, na.rm = TRUE),
-                  lastDate = max(rDateTime),
+                  lastDate = max(rDateTime, na.rm = TRUE),
                  meankW = mean(kW, na.rm = TRUE)
 ), keyby = .(region, feeder_ID)]
@@ -104,7 +107,7 @@ message("So we have ", tidyNum(nrow(origDataDT) - nrow(uniqDataDT)), " duplicate
 pc <- 100*((nrow(origDataDT) - nrow(uniqDataDT))/nrow(origDataDT))
 message("That's ", round(pc,2), "%")
-feederDT <- uniqDataDT # use dt with no duplicates
+feederDT <- uniqDataDT[!is.na(rDateTime)] # use dt with no duplicates
 origDataDT <- NULL # save memory
 ```
@@ -114,7 +117,7 @@ So we remove the duplicates...
 Try aggregated demand profiles of mean kW by season and feeder and day of the week... Remove the legend so we can see the plot.
-```{r kwProfiles}
+```{r kwProfiles, fig.width=8}
 plotDT <- feederDT[, .(meankW = mean(kW),
                       nObs = .N), keyby = .(rTime, season, feeder_ID, rDoW)]
@@ -131,7 +134,7 @@ Is that what we expect?
 Number of observations per feeder per day - gaps will be visible (totally missing days) as will low counts (partially missing days) - we would expect 24 * 4... Convert this to a % of expected...
-```{r basicCountTile, fig.height=10}
+```{r basicCountTile, fig.height=10, fig.width=8}
 plotDT <- feederDT[, .(nObs = .N), keyby = .(rDate, feeder_ID)]
 plotDT[, propExpected := nObs/(24*4)]
@@ -148,7 +151,7 @@ This is not good. There are both gaps (missing days) and partial days. **Lots**
 What does it look like if we aggregate across all feeders by time? There are `r uniqueN(feederDT$feeder_ID)` feeders so we should get this many at best How close do we get?
-```{r aggVisN}
+```{r aggVisN, fig.width=8}
 plotDT <- feederDT[, .(nObs = .N,
                       meankW = mean(kW)), keyby = .(rTime, rDate, season)]
@@ -167,7 +170,7 @@ That really doesn't look too good. There are some very odd fluctuations in there
 What do the mean kw patterns look like per feeder per day?
-```{r basickWTile, fig.height=10}
+```{r basickWTile, fig.height=10, fig.width=8}
 plotDT <- feederDT[, .(meankW = mean(kW, na.rm = TRUE)), keyby = .(rDate, feeder_ID)]
 ggplot2::ggplot(plotDT, aes(x = rDate, y = feeder_ID, fill = meankW)) +
@@ -183,7 +186,7 @@ Missing data is even more clearly visible.
 What about mean kw across all feeders?
-```{r aggViskW}
+```{r aggViskW, fig.width=8}
 plotDT <- feederDT[, .(nObs = .N,
                       meankW = mean(kW)), keyby = .(rTime, rDate, season)]
@@ -213,7 +216,7 @@ summary(dateTimesDT)
 Let's see how many unique feeders we have per dateTime. Surely we have at least one sending data each half-hour?
-```{r tileFeeders}
+```{r tileFeeders, fig.width=8}
 ggplot2::ggplot(dateTimesDT, aes(x = rDate, y =  rTime, fill = nFeeders)) +
  geom_tile() +
  scale_fill_viridis_c() +
@@ -224,7 +227,7 @@ No. As we suspected from the previous plots, we clearly have some dateTimes wher
 Are there time of day patterns? It looks like it...
-```{r missingProfiles}
+```{r missingProfiles, fig.width=8}
 dateTimesDT[, rYear := lubridate::year(rDateTime)]
 plotDT <- dateTimesDT[, .(meanN = mean(nFeeders),
                          meankW = mean(meankW)), keyby = .(rTime, season, rYear)]
@@ -240,7 +243,7 @@ Oh yes. After 2003. Why?
 What about the kW?
-```{r kWProfiles}
+```{r kWProfiles, fig.width=8}
 ggplot2::ggplot(plotDT, aes(y = meankW, x = rTime, colour = season)) +
  geom_line() +
@@ -251,7 +254,7 @@ ggplot2::ggplot(plotDT, aes(y = meankW, x = rTime, colour = season)) +
 Those look as we'd expect. But do we see a correlation between the number of observations per hour and the mean kW after 2003? There is a suspicion that as mean kw goes up so do the number of observations per hour... although this could just be a correlation with low demand periods (night time?)
-```{r compareProfiles}
+```{r compareProfiles, fig.width=8}
 ggplot2::ggplot(plotDT, aes(y = meankW, x = meanN, colour = season)) +
  geom_point() +
  facet_wrap(rYear ~ .) +
@@ -278,7 +281,7 @@ The wide dataset has a count of NAs per row (dateTime) from which we infer how m
 ```{r}
 wDT <- drake::readd(wideData) # back from the drake
-head(wDT)
+names(wDT)
 ```
 If we take the mean of the number of feeders reporting per day (date) then a value of 25 will indicate a day when _all_ feeders have _all_ data (since it would be the mean of all the '25's).
@@ -307,13 +310,13 @@ nrow(aggDT[propExpected == 1])
 If we plot the mean then we will see which days get closest to having a full dataset.
-```{r bestDaysMean}
+```{r bestDaysMean, fig.width=8}
 ggplot2::ggplot(aggDT, aes(x = rDate, colour = season, y = meanOK)) + geom_point()
 ```
 Re-plot by the % of expected if we assume we _should_ have 25 feeders * 24 hours * 4 per hour (will be the same shape):
-```{r bestDaysProp}
+```{r bestDaysProp, fig.width=8}
 ggplot2::ggplot(aggDT, aes(x = rDate, colour = season, y = 100*propExpected)) + geom_point() +
  labs(y = "%")
 ```

--- a/Rmd/cleaningFeederData_allData.log
+++ b/Rmd/cleaningFeederData_allData.log
--- a/_drakeCleanFeeders.R
+++ b/_drakeCleanFeeders.R
@@ -154,8 +154,9 @@ my_plan <- drake::drake_plan(
  wideData = toWide(uniqData),
  saveLong = saveData(uniqData, "L"), # doesn't actually return anything
  saveWide = saveData(wideData, "W"), # doesn't actually return anything
-  htmlOut = makeReport(rmdFile, version, "html"), # html output
+  # pdf output fails
-  pdfOut = makeReport(rmdFile, version, "pdf") # pdf - must be some way to do this without re-running the whole thing
+  #pdfOut = makeReport(rmdFile, version, "pdf"), # pdf - must be some way to do this without re-running the whole thing
+  htmlOut = makeReport(rmdFile, version, "html") # html output
 )
 # see https://books.ropensci.org/drake/projects.html#usage

--- a/bibliography.bib
+++ b/bibliography.bib
+  ##############
+  # R packages
+  @Manual{baseR,
+    title = {R: A Language and Environment for Statistical Computing},
+    author = {{R Core Team}},
+    organization = {R Foundation for Statistical Computing},
+    address = {Vienna, Austria},
+    year = {2016},
+    url = {https://www.R-project.org/},
+  }
+  @Manual{bookdown,
+    title = {bookdown: Authoring Books and Technical Documents with R Markdown},
+    author = {Yihui Xie},
+    year = {2018},
+    note = {R package version 0.9},
+    url = {https://github.com/rstudio/bookdown},
+  }
+  @Manual{data.table,
+    title = {data.table: Extension of Data.frame},
+    author = {M Dowle and A Srinivasan and T Short and S Lianoglou with contributions from R Saporta and E Antonyan},
+    year = {2015},
+    note = {R package version 1.9.6},
+    url = {https://CRAN.R-project.org/package=data.table},
+  }
+  @Article{drake,
+    title = {The drake R package: a pipeline toolkit for reproducibility and high-performance computing},
+    author = {William Michael Landau},
+    journal = {Journal of Open Source Software},
+    year = {2018},
+    volume = {3},
+    number = {21},
+    url = {https://doi.org/10.21105/joss.00550},
+  }
+   @Book{ggplot2,
+    author = {Hadley Wickham},
+    title = {ggplot2: Elegant Graphics for Data Analysis},
+    publisher = {Springer-Verlag New York},
+    year = {2009},
+    isbn = {978-0-387-98140-6},
+    url = {http://ggplot2.org},
+  }
+  @Manual{here,
+    title = {here: A Simpler Way to Find Your Files},
+    author = {Kirill Müller},
+    year = {2017},
+    note = {R package version 0.1},
+    url = {https://CRAN.R-project.org/package=here},
+  }
+@Manual{kableExtra,
+    title = {kableExtra: Construct Complex Table with 'kable' and Pipe Syntax},
+    author = {Hao Zhu},
+    year = {2019},
+    note = {R package version 1.0.1},
+    url = {https://CRAN.R-project.org/package=kableExtra},
+  }
+  @Manual{knitr,
+    title = {knitr: A General-Purpose Package for Dynamic Report Generation in R},
+    author = {Yihui Xie},
+    year = {2016},
+    url = {https://CRAN.R-project.org/package=knitr},
+  }
+   @Article{lubridate,
+    title = {Dates and Times Made Easy with {lubridate}},
+    author = {Garrett Grolemund and Hadley Wickham},
+    journal = {Journal of Statistical Software},
+    year = {2011},
+    volume = {40},
+    number = {3},
+    pages = {1--25},
+    url = {http://www.jstatsoft.org/v40/i03/},
+  }
+  @Manual{rmarkdown,
+    title = {rmarkdown: Dynamic Documents for R},
+    author = {JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone},
+    year = {2020},
+    note = {R package version 2.1},
+    url = {https://github.com/rstudio/rmarkdown},
+  }
+  @Book{rmarkdownBook,
+    title = {R Markdown: The Definitive Guide},
+    author = {Yihui Xie and J.J. Allaire and Garrett Grolemund},
+    publisher = {Chapman and Hall/CRC},
+    address = {Boca Raton, Florida},
+    year = {2018},
+    note = {ISBN 9781138359338},
+    url = {https://bookdown.org/yihui/rmarkdown},
+  }
+    @Manual{skimr,
+    title = {skimr: skimr},
+    author = {Eduardo {Arino de la Rubia} and Hao Zhu and Shannon Ellis and Elin Waring and Michael Quinn},
+    year = {2017},
+    note = {R package version 1.0},
+    url = {https://github.com/ropenscilabs/skimr},
+  }
+  @Manual{tidyverse,
+    title = {tidyverse: Easily Install and Load 'Tidyverse' Packages},
+    author = {Hadley Wickham},
+    year = {2017},
+    note = {R package version 1.1.1},
+    url = {https://CRAN.R-project.org/package=tidyverse},
+  }
+@Manual{viridis,
+    title = {viridis: Default Color Maps from 'matplotlib'},
+    author = {Simon Garnier},
+    year = {2018},
+    note = {R package version 0.5.1},
+    url = {https://CRAN.R-project.org/package=viridis},
+  }
\ No newline at end of file
--- a/docs/cleaningFeederData_allData.tex
+++ b/docs/cleaningFeederData_allData.tex