Administrator approval is now required for registering new accounts. If you are registering a new account, and are external to the University, please ask the repository owner to contact ServiceLine to request your account be approved. Repository owners must include the newly registered email address, and specific repository in the request for approval.

Commit b29c4a30 authored by Ben Anderson's avatar Ben Anderson
Browse files

Merge branch 'reRunEllisFullData' into 'master'

Re-run ellis full data

See merge request ba1e12/datacleaning!2
parents 1bec94da a65bba21
......@@ -27,7 +27,6 @@ output:
toc: yes
toc_depth: 2
fig_width: 5
always_allow_html: yes
bibliography: '`r paste0(here::here(), "/bibliography.bib")`'
---
......@@ -64,7 +63,7 @@ We have some electricity substation feeder data that has been cleaned to give me
There seem to be some NA kW values and a lot of missing time stamps. We want to select the 'best' (i.e most complete) days within a day-of-the-week/season/year sampling frame. If we can't do that we may have to resort to seasonal mean kW profiles by hour & day of the week...
Code used to generate this report: https://git.soton.ac.uk/ba1e12/spatialec/-/blob/master/isleOfWight/cleaningFeederData.Rmd
The code used to generate this report is in: https://git.soton.ac.uk/ba1e12/dataCleaning/Rmd/
# Data prep
......@@ -78,8 +77,7 @@ origDataDT <- drake::readd(origData) # readd the drake object
uniqDataDT <- drake::readd(uniqData) # readd the drake object
kableExtra::kable(head(origDataDT), digits = 2,
caption = "Counts per feeder (long table)") %>%
kable_styling()
caption = "First 6 rows of data")
```
Do a duplicate check by feeder_ID, dateTime & kW. In theory there should not be any.
......@@ -89,16 +87,19 @@ message("Original data nrows: ", tidyNum(nrow(origDataDT)))
message("Unique data nrows: ", tidyNum(nrow(uniqDataDT)))
message("So we have ", tidyNum(nrow(origDataDT) - nrow(uniqDataDT)), " duplicates...")
nDups <- tidyNum(nrow(origDataDT) - nrow(uniqDataDT))
message("So we have ", tidyNum(nDups), " duplicates...")
pc <- 100*((nrow(origDataDT) - nrow(uniqDataDT))/nrow(origDataDT))
message("That's ", round(pc,2), "%")
feederDT <- uniqDataDT[!is.na(rDateTime)] # use dt with no duplicates
origDataDT <- NULL # save memory
```
There were `r tidyNum(nrow(origDataDT) - nrow(uniqDataDT))` duplicates - that's `r round(pc,2)` % of the observations loaded.
There were `r tidyNum(nDups)` duplicates - that's ~ `r round(pc,2)` % of the observations loaded.
So we remove the duplicates...
......@@ -321,23 +322,19 @@ ggplot2::ggplot(aggDT, aes(x = rDate, colour = season,
aggDT[, rDoW := lubridate::wday(rDate, lab = TRUE)]
h <- head(aggDT[season == "Spring"][order(-propExpected)])
kableExtra::kable(h, caption = "Best Spring days overall",
digits = 3) %>%
kable_styling()
digits = 3)
h <- head(aggDT[season == "Summer"][order(-propExpected)])
kableExtra::kable(h, caption = "Best Summer days overall",
digits = 3) %>%
kable_styling()
digits = 3)
h <- head(aggDT[season == "Autumn"][order(-propExpected)])
kableExtra::kable(h, caption = "Best Autumn days overall",
digits = 3) %>%
kable_styling()
digits = 3)
h <- head(aggDT[season == "Winter"][order(-propExpected)])
kableExtra::kable(h, caption = "Best Winter days overall",
digits = 3) %>%
kable_styling()
digits = 3)
```
# Summary
......
This diff is collapsed.
# basic _drake.R style file
# but adapted for use in a project with multiple plans
# but adapted for use in a project where there might be multiple plans in the same folder
# called using r_make() from make_cleanFeeders.R
# see https://books.ropensci.org/drake/projects.html#usage for explanation
# Libraries/Packages ----
# the drake book suggests putting this in packages.R but...
# Libraries ----
library(dataCleaning) # remember to build it first :-)
dataCleaning::setup() # load env.R set up the default paths etc
makeLibs <- c("data.table", # data munching
"drake", # for plans
"here", # here
"lubridate", # dates and times
"hms", # times
......@@ -17,7 +20,7 @@ makeLibs <- c("data.table", # data munching
dataCleaning::loadLibraries(makeLibs)
# Parameters ----
updateData <- "yep" # edit this in any way (at all) to get drake to re-load the data
updateData <- "rerun" # edit this in any way (at all) to get drake to re-load the data
updateReport <- "yes" # edit this in any way (at all) to get drake to re-load the data
# Some data to play with:
......@@ -32,6 +35,7 @@ authors <- "Ben Anderson & Ellis Ridett"
# Functions ----
# for use in drake
# the drake book suggests putting this in functions.R but...
addSeason <- function(dt,dateVar,h){
dt <- dt[, tmpM := lubridate::month(get(dateVar))] # sets 1 (Jan) - 12 (Dec). May already exist but we can't rely on it
......@@ -79,7 +83,10 @@ getData <- function(f,updateData){
makeUniq <- function(dt){
# we suspect there may be duplicates by feeder_ID, dateTime & kW
# remove them (report this in the .Rmd)
uniq <- unique(dt, by = c("rDateTime", "feeder_ID", "kW"))
uniq <- unique(dt, by = c("rDateTime", # dateTime
"feeder_ID", # our constructed unique feeded ID
"kW") # kW
)
return(uniq)
}
......@@ -150,6 +157,8 @@ makeReport <- function(f,version, type = "html", updateReport){
# Set the drake plan ----
# the drake book suggests putting this in plan.R but...
# I had expected r_make() to load drake() in the new clean R session but it doesn't
my_plan <- drake::drake_plan(
origData = getData(dFile, updateData), # returns data as data.table. If you edit 'update' in any way it will reload - drake is watching you!
uniqData = makeUniq(origData), # remove duplicates
......@@ -162,4 +171,5 @@ my_plan <- drake::drake_plan(
)
# see https://books.ropensci.org/drake/projects.html#usage
drake_config(my_plan, verbose = 2)
\ No newline at end of file
# I had expected r_make() to load drake() in the new clean R session but it doesn't
drake::drake_config(my_plan, verbose = 2)
\ No newline at end of file
......@@ -181,7 +181,7 @@ summary {
<h1 class="title toc-ignore">Testing electricity substation/feeder data</h1>
<h3 class="subtitle">Outliers and missing data...</h3>
<h4 class="author">Ben Anderson &amp; Ellis Ridett</h4>
<h4 class="date">Last run at: 2020-07-09 00:56:06</h4>
<h4 class="date">Last run at: 2020-07-09 09:48:01</h4>
 
</div>
 
......@@ -214,7 +214,7 @@ dataCleaning::loadLibraries(rmdLibs)</code></pre>
<h1>Intro</h1>
<p>We have some electricity substation feeder data that has been cleaned to give mean kW per 15 minutes.</p>
<p>There seem to be some NA kW values and a lot of missing time stamps. We want to select the 'best' (i.e most complete) days within a day-of-the-week/season/year sampling frame. If we can't do that we may have to resort to seasonal mean kW profiles by hour &amp; day of the week...</p>
<p>Code used to generate this report: <a href="https://git.soton.ac.uk/ba1e12/spatialec/-/blob/master/isleOfWight/cleaningFeederData.Rmd" class="uri">https://git.soton.ac.uk/ba1e12/spatialec/-/blob/master/isleOfWight/cleaningFeederData.Rmd</a></p>
<p>The code used to generate this report is in: <a href="https://git.soton.ac.uk/ba1e12/dataCleaning/Rmd/" class="uri">https://git.soton.ac.uk/ba1e12/dataCleaning/Rmd/</a></p>
</div>
<div id="data-prep" class="section level1">
<h1>Data prep</h1>
......@@ -226,11 +226,11 @@ dataCleaning::loadLibraries(rmdLibs)</code></pre>
uniqDataDT &lt;- drake::readd(uniqData) # readd the drake object
 
kableExtra::kable(head(origDataDT), digits = 2,
caption = &quot;Counts per feeder (long table)&quot;) %&gt;%
caption = &quot;First 6 rows of data&quot;) %&gt;%
kable_styling()</code></pre>
<table class="table" style="margin-left: auto; margin-right: auto;">
<caption>
Counts per feeder (long table)
First 6 rows of data
</caption>
<thead>
<tr>
......@@ -487,14 +487,16 @@ Winter
 
message(&quot;Unique data nrows: &quot;, tidyNum(nrow(uniqDataDT)))
 
message(&quot;So we have &quot;, tidyNum(nrow(origDataDT) - nrow(uniqDataDT)), &quot; duplicates...&quot;)
nDups &lt;- tidyNum(nrow(origDataDT) - nrow(uniqDataDT))
message(&quot;So we have &quot;, tidyNum(nDups), &quot; duplicates...&quot;)
 
pc &lt;- 100*((nrow(origDataDT) - nrow(uniqDataDT))/nrow(origDataDT))
message(&quot;That's &quot;, round(pc,2), &quot;%&quot;)
 
feederDT &lt;- uniqDataDT[!is.na(rDateTime)] # use dt with no duplicates
origDataDT &lt;- NULL # save memory</code></pre>
<p>There were duplicates - that's 0.38 % of the observations loaded.</p>
<p>There were 83,606 duplicates - that's ~ 0.38 % of the observations loaded.</p>
<p>So we remove the duplicates...</p>
</div>
</div>
......@@ -1500,7 +1502,7 @@ Fri
</div>
<div id="runtime" class="section level1">
<h1>Runtime</h1>
<p>Analysis completed in 196.02 seconds ( 3.27 minutes) using <a href="https://cran.r-project.org/package=knitr">knitr</a> in <a href="http://www.rstudio.com">RStudio</a> with R version 3.6.0 (2019-04-26) running on x86_64-redhat-linux-gnu.</p>
<p>Analysis completed in 218.48 seconds ( 3.64 minutes) using <a href="https://cran.r-project.org/package=knitr">knitr</a> in <a href="http://www.rstudio.com">RStudio</a> with R version 3.6.0 (2019-04-26) running on x86_64-redhat-linux-gnu.</p>
</div>
<div id="r-environment" class="section level1">
<h1>R environment</h1>
......@@ -1539,8 +1541,8 @@ Fri
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.1.0 skimr_2.1.2 ggplot2_3.3.2 hms_0.5.3
## [5] lubridate_1.7.9 here_0.1 drake_7.12.4 data.table_1.12.0
## [1] kableExtra_1.1.0 drake_7.12.4 skimr_2.1.2 ggplot2_3.3.2
## [5] hms_0.5.3 lubridate_1.7.9 here_0.1 data.table_1.12.0
## [9] dataCleaning_0.1.0
##
## loaded via a namespace (and not attached):
......@@ -4,9 +4,9 @@
# Set up ----
startTime <- proc.time()
library(drake)
# use r_make to run the plan inside a clean R session so nothing gets contaminated
drake::r_make(source = "_drakeCleanFeeders.R") # where we keep the drake plan etc
# we don't keep this in /R because that's where the package functions live
# we don't use "_drake.R" because we have lots of different plans
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment