Commit b29c4a30 authored by Ben Anderson's avatar Ben Anderson
Browse files

Merge branch 'reRunEllisFullData' into 'master'

Re-run ellis full data

See merge request !2
parents 1bec94da a65bba21
......@@ -27,7 +27,6 @@ output:
toc: yes
toc_depth: 2
fig_width: 5
always_allow_html: yes
bibliography: '`r paste0(here::here(), "/bibliography.bib")`'
---
......@@ -64,7 +63,7 @@ We have some electricity substation feeder data that has been cleaned to give me
There seem to be some NA kW values and a lot of missing time stamps. We want to select the 'best' (i.e most complete) days within a day-of-the-week/season/year sampling frame. If we can't do that we may have to resort to seasonal mean kW profiles by hour & day of the week...
Code used to generate this report: https://git.soton.ac.uk/ba1e12/spatialec/-/blob/master/isleOfWight/cleaningFeederData.Rmd
The code used to generate this report is in: https://git.soton.ac.uk/ba1e12/dataCleaning/Rmd/
# Data prep
......@@ -78,8 +77,7 @@ origDataDT <- drake::readd(origData) # readd the drake object
uniqDataDT <- drake::readd(uniqData) # readd the drake object
kableExtra::kable(head(origDataDT), digits = 2,
caption = "Counts per feeder (long table)") %>%
kable_styling()
caption = "First 6 rows of data")
```
Do a duplicate check by feeder_ID, dateTime & kW. In theory there should not be any.
......@@ -89,16 +87,19 @@ message("Original data nrows: ", tidyNum(nrow(origDataDT)))
message("Unique data nrows: ", tidyNum(nrow(uniqDataDT)))
message("So we have ", tidyNum(nrow(origDataDT) - nrow(uniqDataDT)), " duplicates...")
nDups <- tidyNum(nrow(origDataDT) - nrow(uniqDataDT))
message("So we have ", tidyNum(nDups), " duplicates...")
pc <- 100*((nrow(origDataDT) - nrow(uniqDataDT))/nrow(origDataDT))
message("That's ", round(pc,2), "%")
feederDT <- uniqDataDT[!is.na(rDateTime)] # use dt with no duplicates
origDataDT <- NULL # save memory
```
There were `r tidyNum(nrow(origDataDT) - nrow(uniqDataDT))` duplicates - that's `r round(pc,2)` % of the observations loaded.
There were `r tidyNum(nDups)` duplicates - that's ~ `r round(pc,2)` % of the observations loaded.
So we remove the duplicates...
......@@ -321,23 +322,19 @@ ggplot2::ggplot(aggDT, aes(x = rDate, colour = season,
aggDT[, rDoW := lubridate::wday(rDate, lab = TRUE)]
h <- head(aggDT[season == "Spring"][order(-propExpected)])
kableExtra::kable(h, caption = "Best Spring days overall",
digits = 3) %>%
kable_styling()
digits = 3)
h <- head(aggDT[season == "Summer"][order(-propExpected)])
kableExtra::kable(h, caption = "Best Summer days overall",
digits = 3) %>%
kable_styling()
digits = 3)
h <- head(aggDT[season == "Autumn"][order(-propExpected)])
kableExtra::kable(h, caption = "Best Autumn days overall",
digits = 3) %>%
kable_styling()
digits = 3)
h <- head(aggDT[season == "Winter"][order(-propExpected)])
kableExtra::kable(h, caption = "Best Winter days overall",
digits = 3) %>%
kable_styling()
digits = 3)
```
# Summary
......
This diff is collapsed.
# basic _drake.R style file
# but adapted for use in a project with multiple plans
# but adapted for use in a project where there might be multiple plans in the same folder
# called using r_make() from make_cleanFeeders.R
# see https://books.ropensci.org/drake/projects.html#usage for explanation
# Libraries/Packages ----
# the drake book suggests putting this in packages.R but...
# Libraries ----
library(dataCleaning) # remember to build it first :-)
dataCleaning::setup() # load env.R set up the default paths etc
makeLibs <- c("data.table", # data munching
"drake", # for plans
"here", # here
"lubridate", # dates and times
"hms", # times
......@@ -17,7 +20,7 @@ makeLibs <- c("data.table", # data munching
dataCleaning::loadLibraries(makeLibs)
# Parameters ----
updateData <- "yep" # edit this in any way (at all) to get drake to re-load the data
updateData <- "rerun" # edit this in any way (at all) to get drake to re-load the data
updateReport <- "yes" # edit this in any way (at all) to get drake to re-load the data
# Some data to play with:
......@@ -32,6 +35,7 @@ authors <- "Ben Anderson & Ellis Ridett"
# Functions ----
# for use in drake
# the drake book suggests putting this in functions.R but...
addSeason <- function(dt,dateVar,h){
dt <- dt[, tmpM := lubridate::month(get(dateVar))] # sets 1 (Jan) - 12 (Dec). May already exist but we can't rely on it
......@@ -79,7 +83,10 @@ getData <- function(f,updateData){
makeUniq <- function(dt){
# we suspect there may be duplicates by feeder_ID, dateTime & kW
# remove them (report this in the .Rmd)
uniq <- unique(dt, by = c("rDateTime", "feeder_ID", "kW"))
uniq <- unique(dt, by = c("rDateTime", # dateTime
"feeder_ID", # our constructed unique feeded ID
"kW") # kW
)
return(uniq)
}
......@@ -150,6 +157,8 @@ makeReport <- function(f,version, type = "html", updateReport){
# Set the drake plan ----
# the drake book suggests putting this in plan.R but...
# I had expected r_make() to load drake() in the new clean R session but it doesn't
my_plan <- drake::drake_plan(
origData = getData(dFile, updateData), # returns data as data.table. If you edit 'update' in any way it will reload - drake is watching you!
uniqData = makeUniq(origData), # remove duplicates
......@@ -162,4 +171,5 @@ my_plan <- drake::drake_plan(
)
# see https://books.ropensci.org/drake/projects.html#usage
drake_config(my_plan, verbose = 2)
\ No newline at end of file
# I had expected r_make() to load drake() in the new clean R session but it doesn't
drake::drake_config(my_plan, verbose = 2)
\ No newline at end of file
......@@ -181,7 +181,7 @@ summary {
<h1 class="title toc-ignore">Testing electricity substation/feeder data</h1>
<h3 class="subtitle">Outliers and missing data...</h3>
<h4 class="author">Ben Anderson &amp; Ellis Ridett</h4>
<h4 class="date">Last run at: 2020-07-09 00:56:06</h4>
<h4 class="date">Last run at: 2020-07-09 09:48:01</h4>
 
</div>
 
......@@ -214,7 +214,7 @@ dataCleaning::loadLibraries(rmdLibs)</code></pre>
<h1>Intro</h1>
<p>We have some electricity substation feeder data that has been cleaned to give mean kW per 15 minutes.</p>
<p>There seem to be some NA kW values and a lot of missing time stamps. We want to select the 'best' (i.e most complete) days within a day-of-the-week/season/year sampling frame. If we can't do that we may have to resort to seasonal mean kW profiles by hour &amp; day of the week...</p>
<p>Code used to generate this report: <a href="https://git.soton.ac.uk/ba1e12/spatialec/-/blob/master/isleOfWight/cleaningFeederData.Rmd" class="uri">https://git.soton.ac.uk/ba1e12/spatialec/-/blob/master/isleOfWight/cleaningFeederData.Rmd</a></p>
<p>The code used to generate this report is in: <a href="https://git.soton.ac.uk/ba1e12/dataCleaning/Rmd/" class="uri">https://git.soton.ac.uk/ba1e12/dataCleaning/Rmd/</a></p>
</div>
<div id="data-prep" class="section level1">
<h1>Data prep</h1>
......@@ -226,11 +226,11 @@ dataCleaning::loadLibraries(rmdLibs)</code></pre>
uniqDataDT &lt;- drake::readd(uniqData) # readd the drake object
 
kableExtra::kable(head(origDataDT), digits = 2,
caption = &quot;Counts per feeder (long table)&quot;) %&gt;%
caption = &quot;First 6 rows of data&quot;) %&gt;%
kable_styling()</code></pre>
<table class="table" style="margin-left: auto; margin-right: auto;">
<caption>
Counts per feeder (long table)
First 6 rows of data
</caption>
<thead>
<tr>
......@@ -487,14 +487,16 @@ Winter
 
message(&quot;Unique data nrows: &quot;, tidyNum(nrow(uniqDataDT)))
 
message(&quot;So we have &quot;, tidyNum(nrow(origDataDT) - nrow(uniqDataDT)), &quot; duplicates...&quot;)
nDups &lt;- tidyNum(nrow(origDataDT) - nrow(uniqDataDT))
message(&quot;So we have &quot;, tidyNum(nDups), &quot; duplicates...&quot;)
 
pc &lt;- 100*((nrow(origDataDT) - nrow(uniqDataDT))/nrow(origDataDT))
message(&quot;That's &quot;, round(pc,2), &quot;%&quot;)
 
feederDT &lt;- uniqDataDT[!is.na(rDateTime)] # use dt with no duplicates
origDataDT &lt;- NULL # save memory</code></pre>
<p>There were duplicates - that's 0.38 % of the observations loaded.</p>
<p>There were 83,606 duplicates - that's ~ 0.38 % of the observations loaded.</p>
<p>So we remove the duplicates...</p>
</div>
</div>
......@@ -1500,7 +1502,7 @@ Fri
</div>
<div id="runtime" class="section level1">
<h1>Runtime</h1>
<p>Analysis completed in 196.02 seconds ( 3.27 minutes) using <a href="https://cran.r-project.org/package=knitr">knitr</a> in <a href="http://www.rstudio.com">RStudio</a> with R version 3.6.0 (2019-04-26) running on x86_64-redhat-linux-gnu.</p>
<p>Analysis completed in 218.48 seconds ( 3.64 minutes) using <a href="https://cran.r-project.org/package=knitr">knitr</a> in <a href="http://www.rstudio.com">RStudio</a> with R version 3.6.0 (2019-04-26) running on x86_64-redhat-linux-gnu.</p>
</div>
<div id="r-environment" class="section level1">
<h1>R environment</h1>
......@@ -1539,8 +1541,8 @@ Fri
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.1.0 skimr_2.1.2 ggplot2_3.3.2 hms_0.5.3
## [5] lubridate_1.7.9 here_0.1 drake_7.12.4 data.table_1.12.0
## [1] kableExtra_1.1.0 drake_7.12.4 skimr_2.1.2 ggplot2_3.3.2
## [5] hms_0.5.3 lubridate_1.7.9 here_0.1 data.table_1.12.0
## [9] dataCleaning_0.1.0
##
## loaded via a namespace (and not attached):
......@@ -4,9 +4,9 @@
# Set up ----
startTime <- proc.time()
library(drake)
# use r_make to run the plan inside a clean R session so nothing gets contaminated
drake::r_make(source = "_drakeCleanFeeders.R") # where we keep the drake plan etc
# we don't keep this in /R because that's where the package functions live
# we don't use "_drake.R" because we have lots of different plans
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment