Skip to content
Snippets Groups Projects

merge a few edits

Merged Ben Anderson requested to merge ba1e12/datacleaning:master into master
5 files
+ 2665
27
Compare changes
  • Side-by-side
  • Inline
Files
5
+ 28
25
@@ -9,25 +9,24 @@ author: '`r params$authors`'
date: 'Last run at: `r Sys.time()`'
output:
bookdown::html_document2:
code_folding: hide
self_contained: TRUE
fig_caption: yes
code_folding: hide
number_sections: yes
self_contained: yes
toc: yes
toc_depth: 3
toc_float: yes
bookdown::word_document2:
fig_caption: yes
toc: yes
toc_depth: 2
toc_float: TRUE
bookdown::pdf_document2:
fig_caption: yes
keep_tex: yes
number_sections: yes
bookdown::word_document2:
fig_caption: yes
number_sections: yes
toc: yes
toc_depth: 2
fig_width: 5
always_allow_html: yes
bibliography: '`r path.expand("~/bibliography.bib")`'
bibliography: '`r paste0(here::here(), "/bibliography.bib")`'
---
```{r setup}
@@ -73,8 +72,12 @@ Loaded data from `r dFile`... (using drake)
```{r loadData}
origDataDT <- drake::readd(origData) # readd the drake object
head(origDataDT)
uniqDataDT <- drake::readd(uniqData) # readd the drake object
kableExtra::kable(head(origDataDT), digits = 2,
caption = "Counts per feeder (long table)") %>%
kable_styling()
```
Check data prep worked OK.
@@ -82,8 +85,8 @@ Check data prep worked OK.
```{r dataPrep}
# check
t <- origDataDT[, .(nObs = .N,
firstDate = min(rDateTime),
lastDate = max(rDateTime),
firstDate = min(rDateTime, na.rm = TRUE),
lastDate = max(rDateTime, na.rm = TRUE),
meankW = mean(kW, na.rm = TRUE)
), keyby = .(region, feeder_ID)]
@@ -104,7 +107,7 @@ message("So we have ", tidyNum(nrow(origDataDT) - nrow(uniqDataDT)), " duplicate
pc <- 100*((nrow(origDataDT) - nrow(uniqDataDT))/nrow(origDataDT))
message("That's ", round(pc,2), "%")
feederDT <- uniqDataDT # use dt with no duplicates
feederDT <- uniqDataDT[!is.na(rDateTime)] # use dt with no duplicates
origDataDT <- NULL # save memory
```
@@ -114,7 +117,7 @@ So we remove the duplicates...
Try aggregated demand profiles of mean kW by season and feeder and day of the week... Remove the legend so we can see the plot.
```{r kwProfiles}
```{r kwProfiles, fig.width=8}
plotDT <- feederDT[, .(meankW = mean(kW),
nObs = .N), keyby = .(rTime, season, feeder_ID, rDoW)]
@@ -131,7 +134,7 @@ Is that what we expect?
Number of observations per feeder per day - gaps will be visible (totally missing days) as will low counts (partially missing days) - we would expect 24 * 4... Convert this to a % of expected...
```{r basicCountTile, fig.height=10}
```{r basicCountTile, fig.height=10, fig.width=8}
plotDT <- feederDT[, .(nObs = .N), keyby = .(rDate, feeder_ID)]
plotDT[, propExpected := nObs/(24*4)]
@@ -148,7 +151,7 @@ This is not good. There are both gaps (missing days) and partial days. **Lots**
What does it look like if we aggregate across all feeders by time? There are `r uniqueN(feederDT$feeder_ID)` feeders so we should get this many at best How close do we get?
```{r aggVisN}
```{r aggVisN, fig.width=8}
plotDT <- feederDT[, .(nObs = .N,
meankW = mean(kW)), keyby = .(rTime, rDate, season)]
@@ -167,7 +170,7 @@ That really doesn't look too good. There are some very odd fluctuations in there
What do the mean kw patterns look like per feeder per day?
```{r basickWTile, fig.height=10}
```{r basickWTile, fig.height=10, fig.width=8}
plotDT <- feederDT[, .(meankW = mean(kW, na.rm = TRUE)), keyby = .(rDate, feeder_ID)]
ggplot2::ggplot(plotDT, aes(x = rDate, y = feeder_ID, fill = meankW)) +
@@ -183,7 +186,7 @@ Missing data is even more clearly visible.
What about mean kw across all feeders?
```{r aggViskW}
```{r aggViskW, fig.width=8}
plotDT <- feederDT[, .(nObs = .N,
meankW = mean(kW)), keyby = .(rTime, rDate, season)]
@@ -213,7 +216,7 @@ summary(dateTimesDT)
Let's see how many unique feeders we have per dateTime. Surely we have at least one sending data each half-hour?
```{r tileFeeders}
```{r tileFeeders, fig.width=8}
ggplot2::ggplot(dateTimesDT, aes(x = rDate, y = rTime, fill = nFeeders)) +
geom_tile() +
scale_fill_viridis_c() +
@@ -224,7 +227,7 @@ No. As we suspected from the previous plots, we clearly have some dateTimes wher
Are there time of day patterns? It looks like it...
```{r missingProfiles}
```{r missingProfiles, fig.width=8}
dateTimesDT[, rYear := lubridate::year(rDateTime)]
plotDT <- dateTimesDT[, .(meanN = mean(nFeeders),
meankW = mean(meankW)), keyby = .(rTime, season, rYear)]
@@ -240,7 +243,7 @@ Oh yes. After 2003. Why?
What about the kW?
```{r kWProfiles}
```{r kWProfiles, fig.width=8}
ggplot2::ggplot(plotDT, aes(y = meankW, x = rTime, colour = season)) +
geom_line() +
@@ -251,7 +254,7 @@ ggplot2::ggplot(plotDT, aes(y = meankW, x = rTime, colour = season)) +
Those look as we'd expect. But do we see a correlation between the number of observations per hour and the mean kW after 2003? There is a suspicion that as mean kw goes up so do the number of observations per hour... although this could just be a correlation with low demand periods (night time?)
```{r compareProfiles}
```{r compareProfiles, fig.width=8}
ggplot2::ggplot(plotDT, aes(y = meankW, x = meanN, colour = season)) +
geom_point() +
facet_wrap(rYear ~ .) +
@@ -278,7 +281,7 @@ The wide dataset has a count of NAs per row (dateTime) from which we infer how m
```{r}
wDT <- drake::readd(wideData) # back from the drake
head(wDT)
names(wDT)
```
If we take the mean of the number of feeders reporting per day (date) then a value of 25 will indicate a day when _all_ feeders have _all_ data (since it would be the mean of all the '25's).
@@ -307,13 +310,13 @@ nrow(aggDT[propExpected == 1])
If we plot the mean then we will see which days get closest to having a full dataset.
```{r bestDaysMean}
```{r bestDaysMean, fig.width=8}
ggplot2::ggplot(aggDT, aes(x = rDate, colour = season, y = meanOK)) + geom_point()
```
Re-plot by the % of expected if we assume we _should_ have 25 feeders * 24 hours * 4 per hour (will be the same shape):
```{r bestDaysProp}
```{r bestDaysProp, fig.width=8}
ggplot2::ggplot(aggDT, aes(x = rDate, colour = season, y = 100*propExpected)) + geom_point() +
labs(y = "%")
```
Loading