Administrator approval is now required for registering new accounts. If you are registering a new account, and are external to the University, please ask the repository owner to contact ServiceLine to request your account be approved. Repository owners must include the newly registered email address, and specific repository in the request for approval.

Commit c11387e4 authored by B.Anderson's avatar B.Anderson
Browse files

amended to reflect slight restructure: only Rmd in /Rmd; drake make & plans at...

amended to reflect slight restructure: only Rmd in /Rmd; drake make & plans at top level (not in /R as drake recommends); outputs into /docs
parent af57f9dc
......@@ -12,6 +12,8 @@ This repo is an R package. This means:
* package functions are kept in /R
* help files auto-created by roxygen are in /man
* if you clone it you can build it and use the functions
* Rmd scripts for reporting the results of drake plans are in /Rmd
* outputs are kept in /docs (reports, plots etc)
* you (and we) keep your data out of it!
We'd love your contributions - feel free to:
......
---
title: 'Cleaning Electricity Substation Feeder Data'
subtitle: 'Code and notes'
author: "Ben Anderson (b.anderson@soton.ac.uk), [SERG, Energy & Climate Change](http://www.energy.soton.ac.uk/), University of Southampton"
params:
subtitle: ""
title: ""
authors: ""
title: '`r params$title`'
subtitle: '`r params$subtitle`'
author: '`r params$authors`'
date: 'Last run at: `r Sys.time()`'
output:
bookdown::pdf_document2:
bookdown::html_document2:
code_folding: hide
fig_caption: yes
keep_tex: yes
number_sections: yes
self_contained: yes
toc: yes
toc_depth: 2
toc_depth: 3
toc_float: yes
bookdown::word_document2:
fig_caption: yes
toc: yes
toc_depth: 2
bookdown::html_document2:
code_folding: hide
bookdown::pdf_document2:
fig_caption: yes
keep_tex: yes
number_sections: yes
self_contained: yes
toc: yes
toc_depth: 3
toc_float: yes
toc_depth: 2
always_allow_html: yes
bibliography: '`r path.expand("~/bibliography.bib")`'
---
......@@ -36,49 +40,20 @@ knitr::opts_chunk$set(message = FALSE) # for final tidy run
startTime <- proc.time()
# Libraries ----
library(dataCleaning)
dataCleaning::setup()
reqLibs <- c("data.table", # cos we like data.table (you may not in which case dplyr is fine :-)
"lubridate", # for data & time manip
"hms", # for hh:mm:ss if we need it
"ggplot2", # fancy plots
"kableExtra", # for extra kable
"skimr")# for skim (data description)
dataCleaning::loadLibraries(reqLibs) # Use Luke's function to load them
rmdLibs <- c("kableExtra" # tables
)
# load them
dataCleaning::loadLibraries(rmdLibs)
# Parameters ----
dFile <- "~/Dropbox/Ben_IOW_SS.csv" # edit for your set up
#dFile <- "~/Dropbox/Ben_IOW_SS.csv" # edit for your set up
# Functions ----
# put more general ones that could be useful to everyone in /R so they are built into the package.
# put functions relevant to this analysis here
# add a season - could become a package function?
addSeason <- function(dt,dateVar,h){
dt <- dt[, tmpM := lubridate::month(get(dateVar))] # sets 1 (Jan) - 12 (Dec). May already exist but we can't rely on it
if(h == "S"){
dt <- dt[, season := "Summer"] # easiest to set the default to be the one that bridges years
dt <- dt[tmpM >= 3 & tmpM <= 5, season := "Autumn"]
dt <- dt[tmpM >= 6 & tmpM <= 8 , season := "Winter"]
dt <- dt[tmpM >= 9 & tmpM <= 11, season := "Spring"]
# re-order to make sense
dt <- dt[, season := factor(season, levels = c("Spring", "Summer", "Autumn", "Winter"))]
}
if(h == "N"){
dt <- dt[, season := "Winter"] # easiest to set the default to be the one that bridges years
dt <- dt[tmpM >= 3 & tmpM <= 5, season := "Spring"]
dt <- dt[tmpM >= 6 & tmpM <= 8 , season := "Summer"]
dt <- dt[tmpM >= 9 & tmpM <= 11, season := "Autumn"]
# re-order to make sense
dt <- dt[, season := factor(season, levels = c("Spring", "Summer", "Autumn", "Winter"))]
}
dt$tmpM <- NULL
return(dt)
}
```
......@@ -94,154 +69,142 @@ Code used to generate this report: https://git.soton.ac.uk/ba1e12/spatialec/-/bl
## Load data
Loading from `r dFile`...
Loaded data from `r dFile`... (using drake)
```{r loadData}
feederDT <- data.table::fread(dFile) # fast
feederDT[, V1 := NULL] #drop
head(feederDT)
origDataDT <- drake::readd(origData) # readd the drake object
head(origDataDT)
uniqDataDT <- drake::readd(uniqData) # readd the drake object
```
Do some data prep
Check data prep worked OK.
```{r dataPrep}
feederDT[, rDateTime := lubridate::as_datetime(DateTime)]
feederDT[, rTime := hms::as_hms(rDateTime)]
feederDT[, rDate := as.Date(rDateTime)]
feederDT[, rYear := lubridate::year(rDate)]
feederDT[, rDoW := lubridate::wday(rDate, label = TRUE)]
feederDT[, kW := Value] # for clarity
feederDT[, Value := NULL] #drop
feederDT <- addSeason(feederDT,
dateVar = "rDateTime",
h = "N")
#skimr::skim(feederDT) doesn't play nicely with latex
st <- summary(feederDT[,.(rYear, rDateTime,
sub_region, kW, rDoW, season)])
kableExtra::kable(st,
caption = "Summary of loaded and processed data") %>%
# check
t <- origDataDT[, .(nObs = .N,
firstDate = min(rDateTime),
lastDate = max(rDateTime),
meankW = mean(kW, na.rm = TRUE)
), keyby = .(region, feeder_ID)]
kableExtra::kable(t, digits = 2,
caption = "Counts per feeder (long table)") %>%
kable_styling()
```
Do a sense check by season & feeder.
Do a duplicate check by feeder_ID, dateTime & kW. In theory there should not be any.
```{r senseCheck}
t <- feederDT[, .(mean_kW = mean(kW),
nObs = .N), keyby = .(rYear, sub_region,season)]
```{r checkForDuplicates}
message("Original data nrows: ", tidyNum(nrow(origDataDT)))
NAt <- feederDT[is.na(kW), .(nObs = .N), keyby = .(rYear, sub_region,season)]
message("Unique data nrows: ", tidyNum(nrow(uniqDataDT)))
kableExtra::kable(t,
caption = "Sense check on raw data") %>%
kable_styling()
```
message("So we have ", tidyNum(nrow(origDataDT) - nrow(uniqDataDT)), " duplicates...")
Histograms of kW by season and feeder...
pc <- 100*((nrow(origDataDT) - nrow(uniqDataDT))/nrow(origDataDT))
message("That's ", round(pc,2), "%")
```{r histo}
ggplot2::ggplot(feederDT, aes(x = kW)) +
geom_histogram(binwidth = 5) +
facet_grid(sub_region ~ season)
feederDT <- uniqDataDT # use dt with no duplicates
origDataDT <- NULL # save memory
```
Is that what we expect to see?
Try looking at the distribution of kW by day of the week and time by feeder and season. This is all data - so it may take some time to plot if you have a lot of data. The smoothed lines are per season/feeder/day generated by ggplot using mgcv::gam.
So we remove the duplicates...
```{r smoothedProfiles}
ggplot2::ggplot(feederDT, aes(x = rTime, y = kW,
colour = season)) +
geom_point(size = 0.2, stroke = 1, shape = 16) +
geom_smooth() + # uses mgcv::gam unless n < 1000
facet_grid(sub_region ~ rDoW)
```
# Basic patterns
There are clearly some outliers but the shapes look right... **What is happening with E2L5??**
Try aggregated demand profiles of mean kW by season and feeder and day of the week...
Try aggregated demand profiles of mean kW by season and feeder and day of the week... Remove the legend so we can see the plot.
```{r kwProfiles}
plotDT <- feederDT[, .(meankW = mean(kW)), keyby = .(rTime, season, sub_region, rDoW)]
ggplot2::ggplot(plotDT, aes(x = rTime, y = meankW, colour = season)) +
plotDT <- feederDT[, .(meankW = mean(kW),
nObs = .N), keyby = .(rTime, season, feeder_ID, rDoW)]
ggplot2::ggplot(plotDT, aes(x = rTime, y = meankW, colour = feeder_ID)) +
geom_line() +
facet_grid(sub_region ~ rDoW)
theme(legend.position="none") + # remove legend so we can see the plot
facet_grid(season ~ rDoW)
```
Is that what we expect?
# Test for missing
Can we see missing data?
Number of observations per feeder per day - gaps will be visible (totally missing days) as will low counts (partially missing days) - we would expect 24 * 4... Convert this to a % of expected...
```{r basicCountTile, fig.height=10}
plotDT <- feederDT[, .(nObs = .N), keyby = .(rDate, feeder_ID)]
plotDT[, propExpected := nObs/(24*4)]
```{r missingVis}
ggplot2::ggplot(feederDT, aes(x = rDate, y = rTime, fill = kW)) +
ggplot2::ggplot(plotDT, aes(x = rDate, y = feeder_ID, fill = 100*propExpected)) +
geom_tile() +
facet_grid(sub_region ~ .) +
scale_fill_viridis_c() +
labs(caption = paste0("All kW values by feeder, time (y) and date (x)",
"\nGrey (blank) regions = completely missing dateTimes")
)
scale_x_date(date_breaks = "3 months", date_labels = "%B %Y") +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
theme(legend.position="bottom") +
scale_fill_viridis_c(name="% expected")
```
oooh. That's not good. There shouldn't be any blank dateTimes.
This is not good. There are both gaps (missing days) and partial days. **Lots** of partial days. Why is the data relatively good up to the end of 2003?
What does it look like if we aggregate across all feeders by time? There are `r uniqueN(feederDT$feeder_ID)` feeders so we should get this many at best How close do we get?
Try aggregating to hours...
```{r aggVisN}
First the mean kw:
plotDT <- feederDT[, .(nObs = .N,
meankW = mean(kW)), keyby = .(rTime, rDate, season)]
```{r aggVis}
feederDT[, rHour := lubridate::hour(rDateTime)] # add hour to the data
plotDT <- feederDT[, .(mean_kW = mean(kW), # aggregate
nObs = .N),
keyby = .(rHour, rDate, sub_region)]
plotDT[, propExpected := nObs/uniqueN(feederDT$feeder_ID)] # we now have all feeders per time so...
ggplot2::ggplot(plotDT, aes(x = rDate, y = rHour,
fill = mean_kW)) + #plot mean kW
ggplot2::ggplot(plotDT, aes(x = rDate, y = rTime, fill = 100*propExpected)) +
geom_tile() +
scale_fill_viridis_c() +
facet_grid(sub_region ~ .) +
labs(caption = "Mean kW per hour")
scale_x_date(date_breaks = "6 months", date_labels = "%B %Y") +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
theme(legend.position="bottom") +
scale_fill_viridis_c(name="% expected")
```
Now the number of observations per hour. The cleaned data is at 15 minutes so we should expect up to 4 observations per hour. How close do we get?
That really doesn't look too good. There are some very odd fluctuations in there. And something changed after 2003...
```{r aggVisN}
ggplot2::ggplot(plotDT, aes(x = rDate, y = rHour, fill = nObs)) +
What do the mean kw patterns look like per feeder per day?
```{r basickWTile, fig.height=10}
plotDT <- feederDT[, .(meankW = mean(kW, na.rm = TRUE)), keyby = .(rDate, feeder_ID)]
ggplot2::ggplot(plotDT, aes(x = rDate, y = feeder_ID, fill = meankW)) +
geom_tile() +
scale_fill_viridis_c() +
facet_grid(sub_region ~ .) +
labs(caption = paste0("Number of obs per hour (should be 4)",
"\nGrey (blank) regions = completely missing dateTimes")
)
scale_x_date(date_breaks = "3 months", date_labels = "%B %Y") +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
theme(legend.position="bottom") +
scale_fill_viridis_c(name="Mean kW")
```
Well one of them hits the 4 per hour most of the time. But what about the evenings? And what about the other feeders???
Missing data is even more clearly visible.
There is a suspicion that as mean kw goes up so do the number of observations per hour...
What about mean kw across all feeders?
```{r aggVisBox}
ggplot2::ggplot(plotDT, aes(x = nObs, y = mean_kW, group = nObs)) +
geom_boxplot() +
facet_grid(.~sub_region) +
labs(caption = "mean kW per hour by number of observations contributing")
```{r aggViskW}
plotDT <- feederDT[, .(nObs = .N,
meankW = mean(kW)), keyby = .(rTime, rDate, season)]
ggplot2::ggplot(plotDT, aes(x = rDate, y = rTime, fill = meankW)) +
geom_tile() +
scale_x_date(date_breaks = "6 months", date_labels = "%B %Y") +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
theme(legend.position="bottom") +
scale_fill_viridis_c(name="kW")
```
Hmm. Maybe...
# Which days have the 'least' missing?
This is quite tricky as we may have completely missing dateTimes. But we can test for this by counting the number of observations per dateTime and then seeing if the dateTimes are contiguous.
```{r testDateTimes}
dateTimesDT <- feederDT[, .(nFeeders = uniqueN(sub_region)),
keyby = .(rDateTime, rTime, rDate, season)] # keep season
dateTimesDT <- feederDT[, .(nFeeders = uniqueN(feeder_ID),
meankW = mean(kW, na.rm = TRUE)),
keyby = .(rDateTime, rTime, rDate, season)] # keep season
dateTimesDT[, dtDiff := rDateTime - shift(rDateTime)] # should be 15 mins
......@@ -257,31 +220,110 @@ ggplot2::ggplot(dateTimesDT, aes(x = rDate, y = rTime, fill = nFeeders)) +
labs(caption = "Number of unique feeders in each dateTime")
```
No. We can clearly have some dateTimes where we have no data _at all_!
No. As we suspected from the previous plots, we clearly have some dateTimes where we have no data _at all_!
Are there time of day patterns? It looks like it...
```{r missingProfiles}
plotDT <- dateTimesDT[, .(meanN = mean(nFeeders)), keyby = .(rTime, season)]
dateTimesDT[, rYear := lubridate::year(rDateTime)]
plotDT <- dateTimesDT[, .(meanN = mean(nFeeders),
meankW = mean(meankW)), keyby = .(rTime, season, rYear)]
ggplot2::ggplot(plotDT, aes(y = meanN, x = rTime, colour = season)) +
geom_line() +
geom_line() +
facet_wrap(rYear ~ .) +
labs(y = "Mean n feeders reporting",
caption = "Mean n feeders by time of day")
```
Oh yes. Why?
Oh yes. After 2003. Why?
# Summary
What about the kW?
```{r kWProfiles}
So...
ggplot2::ggplot(plotDT, aes(y = meankW, x = rTime, colour = season)) +
geom_line() +
facet_wrap(rYear ~ .) +
labs(y = "Mean kw reporting",
caption = "Mean kw by time of day")
```
Those look as we'd expect. But do we see a correlation between the number of observations per hour and the mean kW after 2003? There is a suspicion that as mean kw goes up so do the number of observations per hour... although this could just be a correlation with low demand periods (night time?)
```{r compareProfiles}
ggplot2::ggplot(plotDT, aes(y = meankW, x = meanN, colour = season)) +
geom_point() +
facet_wrap(rYear ~ .) +
labs(y = "Mean kw per quarter hour",
x = "Mean number feeders reporting")
```
I find it distinctly odd that:
Yes. The higher the kW, the more observations we get from 2004 onwards. Why?
* we appear to have the most feeders reporting data at 'peak' times
* we have a lot of missing dateTimes between 00:30 and 05:00
It is distinctly odd that after 2003:
* we appear to have the most feeders reporting data at 'peak' times
* we have a lot of missing dateTimes between 00:30 and 05:00
If the monitors were set to only collect data when the power (or Wh in a given time frame) was above a given threshold then it would look like this... That wouldn't happen... would it?
# Selecting the 'best' days
Here we use a wide form of the feeder data which has each feeder as a column.
We should have `r uniqueN(feederDT$feeder_ID)` feeders. We want to find days when all of these feeders have complete data.
The wide dataset has a count of NAs per row (dateTime) from which we infer how many feeders are reporting:
```{r}
wDT <- drake::readd(wideData) # back from the drake
head(wDT)
```
If we take the mean of the number of feeders reporting per day (date) then a value of 25 will indicate a day when _all_ feeders have _all_ data (since it would be the mean of all the '25's).
```{r testWide}
wDT <- addSeason(wDT, dateVar = "rDateTime", h = "N")
wDT[, rDoW := lubridate::wday(rDateTime)]
wDT[, rDate := lubridate::date(rDateTime)]
# how many days have all feeders sending data in all dateTimes?
aggDT <- wDT[, .(meanOK = mean(nFeedersReporting),
minOk = min(nFeedersReporting),
maxOk = max(nFeedersReporting),
sumOK = sum(nFeedersReporting) # will have a max of n feeders * 24 hours * 4 quarter hours
),
keyby = .(rDate, season)]
aggDT[, propExpected := sumOK/(uniqueN(feederDT$feeder_ID)*24*4)] # we expect 25*24*4
summary(aggDT)
message("How many days have 100%?")
nrow(aggDT[propExpected == 1])
```
If we plot the mean then we will see which days get closest to having a full dataset.
```{r bestDaysMean}
ggplot2::ggplot(aggDT, aes(x = rDate, colour = season, y = meanOK)) + geom_point()
```
Re-plot by the % of expected if we assume we _should_ have 25 feeders * 24 hours * 4 per hour (will be the same shape):
```{r bestDaysProp}
ggplot2::ggplot(aggDT, aes(x = rDate, colour = season, y = 100*propExpected)) + geom_point() +
labs(y = "%")
```
This also tells us that there is some reason why we get fluctations in the number of data points per hour after 2003.
# Summary
So there are no days with 100% data. We need a different approach.
# Runtime
```{r check runtime, include=FALSE}
......@@ -295,15 +337,15 @@ Analysis completed in `r round(elapsed,2)` seconds ( `r round(elapsed/60,2)` min
## R packages used
* base R [@baseR]
* bookdown [@bookdown]
* data.table [@data.table]
* ggplot2 [@ggplot2]
* kableExtra [@kableExtra]
* knitr [@knitr]
* lubridate [@lubridate]
* rmarkdown [@rmarkdown]
* skimr [@skimr]
* base R [@baseR]
* bookdown [@bookdown]
* data.table [@data.table]
* ggplot2 [@ggplot2]
* kableExtra [@kableExtra]
* knitr [@knitr]
* lubridate [@lubridate]
* rmarkdown [@rmarkdown]
* skimr [@skimr]
## Session info
......@@ -311,7 +353,7 @@ Analysis completed in `r round(elapsed,2)` seconds ( `r round(elapsed/60,2)` min
sessionInfo()
```
# The data cleaning code
# The raw data cleaning code
(c) Mikey Harper :-)
......@@ -516,7 +558,7 @@ processAMPS_5mins <- function(filePath){
mutate(Time = lubridate::floor_date(Time, unit = "5 minutes")) %>%
group_by(Time, region, code) %>%
summarise(Value = mean(Value)) %>%
arrange(Time)
arrange(Time)
# There are some datasets which contain no values, whch can cause errors in running
# If this happens, return NULL
......
# basic _drake.R style file
# but adapted for use in a project with multiple plans
# Libraries ----
library(dataCleaning) # remember to build it first :-)
dataCleaning::setup() # load env.R set up the default paths etc
makeLibs <- c("data.table", # data munching
"drake", # for plans
"here", # here
"lubridate", # dates and times
"hms", # times
"ggplot2", # plots
"skimr" # for skim
)
# load them
dataCleaning::loadLibraries(makeLibs)
# Parameters ----
# Some data to play with:
dFile <- "/mnt/SERG_data/Ellis_IOW/Cleaned_SS_Amps/amps_all_substations.csv.gz" # edit for your set up
rmdFile <- "cleaningFeederData" # <- name of the .Rmd file to run at the end
version <- "_allData"
title <- "Testing electricity substation/feeder data"
subtitle <- "Outliers and missing data..."
authors <- "Ben Anderson & Ellis Ridett"
# Functions ----
# for use in drake
addSeason <- function(dt,dateVar,h){
dt <- dt[, tmpM := lubridate::month(get(dateVar))] # sets 1 (Jan) - 12 (Dec). May already exist but we can't rely on it
if(h == "S"){
dt <- dt[, season := "Summer"] # easiest to set the default to be the one that bridges years
dt <- dt[tmpM >= 3 & tmpM <= 5, season := "Autumn"]
dt <- dt[tmpM >= 6 & tmpM <= 8 , season := "Winter"]
dt <- dt[tmpM >= 9 & tmpM <= 11, season := "Spring"]
# re-order to make sense
dt <- dt[, season := factor(season, levels = c("Spring", "Summer", "Autumn", "Winter"))]
}
if(h == "N"){
dt <- dt[, season := "Winter"] # easiest to set the default to be the one that bridges years
dt <- dt[tmpM >= 3 & tmpM <= 5, season := "Spring"]
dt <- dt[tmpM >= 6 & tmpM <= 8 , season := "Summer"]
dt <- dt[tmpM >= 9 & tmpM <= 11, season := "Autumn"]
# re-order to make sense
dt <- dt[, season := factor(season, levels = c("Spring", "Summer", "Autumn", "Winter"))]
}
dt$tmpM <- NULL
return(dt)
}
getData <- function(f,update){
# gets the data
dt <- data.table::fread(f)
dt[, rDateTime := lubridate::as_datetime(Time)] # the dateTime is now called Time!!!
dt[, rTime := hms::as_hms(rDateTime)]
dt[, rDate := as.Date(rDateTime)]
dt[, rYear := lubridate::year(rDate)]
dt[, rDoW := lubridate::wday(rDate, label = TRUE)]
dt[, kW := Value] # for clarity
dt[, Value := NULL] #drop
# set sub_region to region when sub_region not set (only 1 feeder)
dt[, sub_region := ifelse(sub_region == "", region, # set to region if empty
sub_region)]
# some of the feeder IDs are re-used across the substations
dt[, feeder_ID := paste0(region,"_", sub_region)] # create a unique feeder ID Ellis!!!!!
dt <- addSeason(dt, dateVar = "rDateTime", h = "N") # do this here so it's done
return(dt)
}
makeUniq <- function(dt){
# we suspect there may be duplicates by feeder_ID, dateTime & kW
# remove them (report this in the .Rmd)
uniq <- unique(dt, by = c("rDateTime", "feeder_ID", "kW"))
return(uniq)
}
toWide <- function(dt){
# converts to wide form so each feeder is in a column - so we can check across feeders for dateTimes with no missing data
# dt <- feederDT
wDT <- dcast(dt, rDateTime ~ feeder_ID,
fun.aggregate = "mean", # irrelevant, no aggregation happening
fill = NA, # missing dateTimes will be NA
#drop = FALSE, # keep all missing combinations - don't do this, bloats the data with useless NAs
value.var = "kW")
wDT$nNA <- rowSums(is.na(wDT))
colCount <- ncol(wDT)
wDT[, nFeedersReporting := colCount - nNA - 2] # 2 as dateTime and mNA = 2 cols
return(wDT)
}
saveData <- function(dt, which){
# Save the (newly) cleaned data
if(which == "L"){
# long data
of <- "/mnt/SERG_data/Ellis_IOW/Cleaned_SS_Amps/uniq/uniq_amps_all_substations.csv"
data.table::fwrite(dt[, .(rDateTime, region, sub_region, feeder_ID, kW)], # only the vars we can't recreate
of)
cmd <- paste0("gzip -f ", of)
message("Gzip file: ", of)
try(system(cmd)) # seems to throw an error on the RStudio server but it still works
message("Done ")
}
if(which == "W"){
# wide data (feeders as columns)
of <- "/mnt/SERG_data/Ellis_IOW/Cleaned_SS_Amps/uniq/uniq_amps_all_substations_wide.csv"
# remove the vars we don't need?
dt$nNA <- NULL
data.table::fwrite(dt, of)
cmd <- paste0("gzip -f ", of)
message("Gzip file: ", of)
try(system(cmd)) # seems to throw an error on the RStudio server but it still works
message("Done ")
}
}
makeReport <- function(f,version, type = "html"){
# default = html
message("Rendering ", f, ".Rmd (version: "<