We have some feeder data. There seem to be NAs and missing time stamps. We want to select the 'best' (i.e most complete) days within a day-of-the-week/season/year sampling frame.
ggplot2::ggplot(dt, aes(x = rDate, y = rTime, fill = kW)) +
geom_tile() +
facet_grid(sub_region ~ .) +
scale_fill_viridis_c() +
labs(caption = "All kW values by feeder, time (y) and date (x)")
```
oooh.
Try aggregating...
```{r aggVis}
dt[, rHour := lubridate::hour(rDateTime)]
plotDT <- dt[, .(mean_kW = mean(kW),
nObs = .N), keyby = .(rHour, rDate, sub_region)]
ggplot2::ggplot(plotDT, aes(x = rDate, y = rHour, fill = mean_kW)) +
geom_tile() +
scale_fill_viridis_c() +
facet_grid(sub_region ~ .) +
labs(caption = "Mean kW per hour")
ggplot2::ggplot(plotDT, aes(x = rDate, y = rHour, fill = nObs)) +
geom_tile() +
scale_fill_viridis_c() +
facet_grid(sub_region ~ .) +
labs(caption = "Number of obs per hour")
ggplot2::ggplot(plotDT, aes(x = nObs, y = mean_kW, group = nObs)) +
geom_boxplot() +
facet_grid(.~sub_region) +
labs(caption = "mean kW per hour by number of observations contributing")
```
# Which days have the 'least' missing?
This is quite tricky as we may have completely missing dateTimes. But we can test for this by counting the number of observations per dateTime and then seeing if the dateTimes are contiguous.
* we appear to have the most feeders reporting data at 'peak' times
* we have a lot of missing dateTimes between 00:30 and 05:00
If the monitors were set to only collect data when the power (or Wh in a given time frame) was above a given threshold then it would look like this...
# Runtime
```{r check runtime, include=FALSE}
t <- proc.time() - startTime
elapsed <- t[[3]]
```
Analysis completed in `r round(elapsed,2)` seconds ( `r round(elapsed/60,2)` minutes) using [knitr](https://cran.r-project.org/package=knitr) in [RStudio](http://www.rstudio.com) with `r R.version.string` running on `r R.version$platform`.
# R environment
## R packages used
* base R [@baseR]
* bookdown [@bookdown]
* data.table [@data.table]
* ggplot2 [@ggplot2]
* kableExtra [@kableExtra]
* knitr [@knitr]
* lubridate [@lubridate]
* rmarkdown [@rmarkdown]
* skimr [@skimr]
* XML [@XML]
## Session info
```{r sessionInfo, echo=FALSE}
sessionInfo()
```
# The data cleaning code
(c) Mikey Harper :-)
Starts here:
<hr>
Scripts used clean and merge substation data.
```{r setupMH, include=FALSE, eval=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(here)
library(tidyverse)
```
## Input files
Analysis will first look at the primary data. There are different types of files which refer to different paramters. Different search terms are used to extract these:
```{r, eval=FALSE}
# Find files with AMPS. Exclude files which contain DI~CO
# Load the CSV. There were some tab seperated files which are saved as CSVs, which confuse the search. There if the data is loaded incorrectly (only having a single column), the code will try and load it as a TSV.
con <- dbConnect(RSQLite::SQLite(), "amps.sqlite")
dbListTables(con)
dbWriteTable(con, "amps", Amps)
dbListTables(con)
```
## Querying the data
```{r, eval=FALSE}
con <- dbConnect(RSQLite::SQLite(), "amps.sqlite")
library(dbplyr)
Amps_db <- tbl(con, "amps")
flights_db %>%
group_by(region) %>%
summarise(mean = (mean(Value, na.rm = T)),
n = n(),
sd = sd(Value, na.rm = T),
var = var(Value, na.rm = T))
```
## Round to Nearest N minutes
```{r, eval=FALSE}
processAMPS_5mins <- function(filePath){
message("Processing ", filePath)
# 1st Level
dirName_1 <- filePath %>%
dirname() %>%
basename
# 2nd Level
dirName_2 <- filePath %>%
dirname() %>%
dirname() %>%
basename
if (dirName_2 == "Primary"){
dirName_2 <- dirName_1
dirName_1 <- ""
}
# Load the CSV. There were some tab seperated files which are saved as CSVs, which confuse the search. There if the data is loaded incorrectly (only having a single column), the code will try and load it as a TSV.