-
B.Anderson authoredB.Anderson authored
- Intro
- Data prep
- Load data
- Basic patterns
- Test for missing
- Which days have the 'least' missing?
- Selecting the 'best' days
- Summary
- Runtime
- R environment
- R packages used
- Session info
- The raw data cleaning code
- Input files
- Process Amps
- Processing data
- Querying the data
- Round to Nearest N minutes
- References
params:
subtitle: ""
title: ""
authors: ""
title: '`r params$title`'
subtitle: '`r params$subtitle`'
author: '`r params$authors`'
date: 'Last run at: `r Sys.time()`'
output:
bookdown::html_document2:
self_contained: no
fig_caption: yes
code_folding: hide
number_sections: yes
toc: yes
toc_depth: 2
toc_float: TRUE
bookdown::pdf_document2:
fig_caption: yes
number_sections: yes
toc: yes
toc_depth: 2
bookdown::word_document2:
fig_caption: yes
number_sections: yes
toc: yes
toc_depth: 2
fig_width: 5
bibliography: '`r paste0(here::here(), "/bibliography.bib")`'
# Knitr setup ----
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = FALSE) # for final tidy run
knitr::opts_chunk$set(message = FALSE) # for final tidy run
# Set start time ----
startTime <- proc.time()
# Libraries ----
rmdLibs <- c("kableExtra" # tables
)
# load them
dataCleaning::loadLibraries(rmdLibs)
# Parameters ----
#dFile <- "~/Dropbox/Ben_IOW_SS.csv" # edit for your set up
# Functions ----
# put more general ones that could be useful to everyone in /R so they are built into the package.
# put functions relevant to this analysis here
Intro
We have some electricity substation feeder data that has been cleaned to give mean kW per 15 minutes.
There seem to be some NA kW values and a lot of missing time stamps. We want to select the 'best' (i.e most complete) days within a day-of-the-week/season/year sampling frame. If we can't do that we may have to resort to seasonal mean kW profiles by hour & day of the week...
The code used to generate this report is in: https://git.soton.ac.uk/ba1e12/dataCleaning/Rmd/
Data prep
Load data
Loaded data from r dFile
... (using drake)
origDataDT <- drake::readd(origData) # readd the drake object
uniqDataDT <- drake::readd(uniqData) # readd the drake object
kableExtra::kable(head(origDataDT), digits = 2,
caption = "First 6 rows of data")
Do a duplicate check by feeder_ID, dateTime & kW. In theory there should not be any.
message("Original data nrows: ", tidyNum(nrow(origDataDT)))
message("Unique data nrows: ", tidyNum(nrow(uniqDataDT)))
nDups <- tidyNum(nrow(origDataDT) - nrow(uniqDataDT))
message("So we have ", tidyNum(nDups), " duplicates...")
pc <- 100*((nrow(origDataDT) - nrow(uniqDataDT))/nrow(origDataDT))
message("That's ", round(pc,2), "%")
feederDT <- uniqDataDT[!is.na(rDateTime)] # use dt with no duplicates
origDataDT <- NULL # save memory
There were r tidyNum(nDups)
duplicates - that's ~ r round(pc,2)
% of the observations loaded.
So we remove the duplicates...
Basic patterns
Try aggregated demand profiles of mean kW by season and feeder and day of the week... Remove the legend so we can see the plot.
plotDT <- feederDT[, .(meankW = mean(kW),
nObs = .N), keyby = .(rTime, season, feeder_ID, rDoW)]
ggplot2::ggplot(plotDT, aes(x = rTime, y = meankW, colour = feeder_ID)) +
geom_line() +
theme(legend.position="none") + # remove legend so we can see the plot
facet_grid(season ~ rDoW)
Is that what we expect?
Test for missing
Number of observations per feeder per day - gaps will be visible (totally missing days) as will low counts (partially missing days) - we would expect 24 * 4... Convert this to a % of expected...
plotDT <- feederDT[, .(nObs = .N), keyby = .(rDate, feeder_ID)]
plotDT[, propExpected := nObs/(24*4)]
ggplot2::ggplot(plotDT, aes(x = rDate, y = feeder_ID, fill = 100*propExpected)) +
geom_tile() +
scale_x_date(date_breaks = "3 months", date_labels = "%B %Y") +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
theme(legend.position="bottom") +
scale_fill_viridis_c(name="% expected")
This is not good. There are both gaps (missing days) and partial days. Lots of partial days. Why is the data relatively good up to the end of 2003?
What does it look like if we aggregate across all feeders by time? There are r uniqueN(feederDT$feeder_ID)
feeders so we should get this many at best How close do we get?
plotDT <- feederDT[, .(nObs = .N,
meankW = mean(kW)), keyby = .(rTime, rDate, season)]
plotDT[, propExpected := nObs/uniqueN(feederDT$feeder_ID)] # we now have all feeders per time so...
ggplot2::ggplot(plotDT, aes(x = rDate, y = rTime, fill = 100*propExpected)) +
geom_tile() +
scale_x_date(date_breaks = "6 months", date_labels = "%B %Y") +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
theme(legend.position="bottom") +
scale_fill_viridis_c(name="% expected")
That really doesn't look too good. There are some very odd fluctuations in there. And something changed after 2003...
What do the mean kw patterns look like per feeder per day?
plotDT <- feederDT[, .(meankW = mean(kW, na.rm = TRUE)), keyby = .(rDate, feeder_ID)]
ggplot2::ggplot(plotDT, aes(x = rDate, y = feeder_ID, fill = meankW)) +
geom_tile() +
scale_x_date(date_breaks = "3 months", date_labels = "%B %Y") +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
theme(legend.position="bottom") +
scale_fill_viridis_c(name="Mean kW")
Missing data is even more clearly visible.
What about mean kw across all feeders?
plotDT <- feederDT[, .(nObs = .N,
meankW = mean(kW)), keyby = .(rTime, rDate, season)]
ggplot2::ggplot(plotDT, aes(x = rDate, y = rTime, fill = meankW)) +
geom_tile() +
scale_x_date(date_breaks = "6 months", date_labels = "%B %Y") +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
theme(legend.position="bottom") +
scale_fill_viridis_c(name="kW")
Which days have the 'least' missing?
This is quite tricky as we may have completely missing dateTimes. But we can test for this by counting the number of observations per dateTime and then seeing if the dateTimes are contiguous.
dateTimesDT <- feederDT[, .(nFeeders = uniqueN(feeder_ID),
meankW = mean(kW, na.rm = TRUE)),
keyby = .(rDateTime, rTime, rDate, season)] # keep season
dateTimesDT[, dtDiff := rDateTime - shift(rDateTime)] # should be 15 mins
summary(dateTimesDT)
Let's see how many unique feeders we have per dateTime. Surely we have at least one sending data each half-hour?
ggplot2::ggplot(dateTimesDT, aes(x = rDate, y = rTime, fill = nFeeders)) +
geom_tile() +
scale_fill_viridis_c() +
labs(caption = "Number of unique feeders in each dateTime")
No. As we suspected from the previous plots, we clearly have some dateTimes where we have no data at all!
Are there time of day patterns? It looks like it...
dateTimesDT[, rYear := lubridate::year(rDateTime)]
plotDT <- dateTimesDT[, .(meanN = mean(nFeeders),
meankW = mean(meankW)), keyby = .(rTime, season, rYear)]
ggplot2::ggplot(plotDT, aes(y = meanN, x = rTime, colour = season)) +
geom_line() +
facet_wrap(rYear ~ .) +
labs(y = "Mean n feeders reporting",
caption = "Mean n feeders by time of day")
Oh yes. After 2003. Why?
What about the kW?
ggplot2::ggplot(plotDT, aes(y = meankW, x = rTime, colour = season)) +
geom_line() +
facet_wrap(rYear ~ .) +
labs(y = "Mean kw reporting",
caption = "Mean kw by time of day")
Those look as we'd expect. But do we see a correlation between the number of observations per hour and the mean kW after 2003? There is a suspicion that as mean kw goes up so do the number of observations per hour... although this could just be a correlation with low demand periods (night time?)
ggplot2::ggplot(plotDT, aes(y = meankW, x = meanN, colour = season)) +
geom_point() +
facet_wrap(rYear ~ .) +
labs(y = "Mean kw per quarter hour",
x = "Mean number feeders reporting")
Yes. The higher the kW, the more observations we get from 2004 onwards. Why?
It is distinctly odd that after 2003:
- we appear to have the most feeders reporting data at 'peak' times
- we have a lot of missing dateTimes between 00:30 and 05:00
If the monitors were set to only collect data when the power (or Wh in a given time frame) was above a given threshold then it would look like this... That wouldn't happen... would it?
Selecting the 'best' days
Here we use a wide form of the feeder data which has each feeder as a column.
We should have r uniqueN(feederDT$feeder_ID)
feeders. We want to find days when all of these feeders have complete data.
The wide dataset has a count of NAs per row (dateTime) from which we infer how many feeders are reporting:
wDT <- drake::readd(wideData) # back from the drake
names(wDT)
If we take the mean of the number of feeders reporting per day (date) then a value of 25 will indicate a day when all feeders have all data (since it would be the mean of all the '25's).
wDT <- addSeason(wDT, dateVar = "rDateTime", h = "N")
wDT[, rDoW := lubridate::wday(rDateTime)]
wDT[, rDate := lubridate::date(rDateTime)]
# how many days have all feeders sending data in all dateTimes?
aggDT <- wDT[, .(meanOK = mean(nFeedersReporting),
minOk = min(nFeedersReporting),
maxOk = max(nFeedersReporting),
sumOK = sum(nFeedersReporting) # will have a max of n feeders * 24 hours * 4 quarter hours
),
keyby = .(rDate, season)]
aggDT[, propExpected := sumOK/(uniqueN(feederDT$feeder_ID)*24*4)] # we expect 25*24*4
summary(aggDT)
message("How many days have 100%?")
n <- nrow(aggDT[propExpected == 1])
n
So, there are r n
days with 100% data...
If we plot the mean then we will see which days get closest to having a full dataset.
ggplot2::ggplot(aggDT, aes(x = rDate, colour = season, y = meanOK)) +
geom_point()
Re-plot by the % of expected if we assume we should have n feeders * 24 hours * 4 per hour (will be the same shape). This also tells us that there is some reason why we get fluctations in the number of data points per hour after 2003.
For fun we then print 4 tables of the 'best' days per season.
ggplot2::ggplot(aggDT, aes(x = rDate, colour = season,
y = 100*propExpected)) +
geom_point() +
labs(y = "%")
aggDT[, rDoW := lubridate::wday(rDate, lab = TRUE)]
h <- head(aggDT[season == "Spring"][order(-propExpected)])
kableExtra::kable(h, caption = "Best Spring days overall",
digits = 3)
h <- head(aggDT[season == "Summer"][order(-propExpected)])
kableExtra::kable(h, caption = "Best Summer days overall",
digits = 3)
h <- head(aggDT[season == "Autumn"][order(-propExpected)])
kableExtra::kable(h, caption = "Best Autumn days overall",
digits = 3)
h <- head(aggDT[season == "Winter"][order(-propExpected)])
kableExtra::kable(h, caption = "Best Winter days overall",
digits = 3)
Summary
So there are no days with 100% data. We need a different approach.
Runtime
t <- proc.time() - startTime
elapsed <- t[[3]]
Analysis completed in r round(elapsed,2)
seconds ( r round(elapsed/60,2)
minutes) using knitr in RStudio with r R.version.string
running on r R.version$platform
.
R environment
R packages used
- base R [@baseR]
- bookdown [@bookdown]
- data.table [@data.table]
- ggplot2 [@ggplot2]
- kableExtra [@kableExtra]
- knitr [@knitr]
- lubridate [@lubridate]
- rmarkdown [@rmarkdown]
- skimr [@skimr]
Session info
sessionInfo()
The raw data cleaning code
(c) Mikey Harper :-)
Starts here:
Scripts used clean and merge substation data.
knitr::opts_chunk$set(echo = TRUE)
library(here)
library(tidyverse)
Input files
Analysis will first look at the primary data. There are different types of files which refer to different paramters. Different search terms are used to extract these:
# Find files with AMPS. Exclude files which contain DI~CO
files_AMPS <- list.files("../Primary", recursive = T, pattern = "~AMPS", full.names = T) %>%
.[!stringr::str_detect (., "DI~CO")]
files_AMPS
Process Amps
# Show a sample
fileSelect <- files_AMPS[4]
head(read_csv(fileSelect, skip = 3))
Again a function is used to do all the processing on the input CSVs. This is slightly amended from the processkV
function.
processAMPS <- function(filePath, databaseCon = con){
message("Processing ", filePath)
# 1st Level
dirName_1 <- filePath %>%
dirname() %>%
basename
# 2nd Level
dirName_2 <- filePath %>%
dirname() %>%
dirname() %>%
basename
if (dirName_2 == "Primary"){
dirName_2 <- dirName_1
dirName_1 <- ""
}
# Load the CSV. There were some tab seperated files which are saved as CSVs, which confuse the search. There if the data is loaded incorrectly (only having a single column), the code will try and load it as a TSV.
dataLoaded <- suppressWarnings(read_csv(filePath, skip = 3, col_types = cols(Value = col_number())))
if(ncol(dataLoaded) == 1){
dataLoaded <- suppressWarnings(read_tsv(filePath, skip = 3, col_types = cols()))
}
# Reformat data
dataLoaded <-
dataLoaded %>%
mutate_at(vars(Time), function(x){gsub('[^ -~]', '', x)}) %>% # Remove invalid UTF characters
mutate(Time = lubridate::dmy_hms(Time),
Time = lubridate::floor_date(Time, unit = "15 minutes")) %>%
group_by(Time) %>%
summarise(Value = mean(Value, na.rm = T)) %>%
mutate(region = dirName_2,
sub_region = dirName_1
)
# There are some datasets which contain no values, whch can cause errors in running
# If this happens, return NULL
if(is.character(dataLoaded$Value)) return(NULL)
return(dataLoaded)
}
Run the function below:
Amps <- purrr::map_df(files_AMPS, processAMPS)
Amps_stats <- Amps %>%
group_by(region) %>%
summarise(mean = (mean(Value, na.rm = T)),
n = n(),
sd = sd(Value, na.rm = T),
var = var(Value, na.rm = T))
Amps_stats
readr::write_csv(Amps_stats, path = "../Amps_stats.csv")
ggplot(Amps) +
geom_point(aes(x = Time, y = Value, colour = region)) +
facet_grid(region~., scales = "free_y") +
labs(title = "Cleaned data for Amps")
Processing data
readr::write_csv(Amps, path = "amps_all_substations.csv")
library(odbc)
library(DBI)
# Create an ephemeral in-memory RSQLite database
con <- dbConnect(RSQLite::SQLite(), "amps.sqlite")
dbListTables(con)
dbWriteTable(con, "amps", Amps)
dbListTables(con)
Querying the data
con <- dbConnect(RSQLite::SQLite(), "amps.sqlite")
library(dbplyr)
Amps_db <- tbl(con, "amps")
flights_db %>%
group_by(region) %>%
summarise(mean = (mean(Value, na.rm = T)),
n = n(),
sd = sd(Value, na.rm = T),
var = var(Value, na.rm = T))
Round to Nearest N minutes
processAMPS_5mins <- function(filePath){
message("Processing ", filePath)
# 1st Level
dirName_1 <- filePath %>%
dirname() %>%
basename
# 2nd Level
dirName_2 <- filePath %>%
dirname() %>%
dirname() %>%
basename
if (dirName_2 == "Primary"){
dirName_2 <- dirName_1
dirName_1 <- ""
}
# Load the CSV. There were some tab seperated files which are saved as CSVs, which confuse the search. There if the data is loaded incorrectly (only having a single column), the code will try and load it as a TSV.
dataLoaded <- suppressWarnings(read_csv(filePath, skip = 3, col_types = cols()))
if(ncol(dataLoaded) == 1){
dataLoaded <- suppressWarnings(read_tsv(filePath, skip = 3, col_types = cols()))
}
# Reformat data
dataLoaded <-
dataLoaded %>%
mutate_at(vars(Time), function(x){gsub('[^ -~]', '', x)}) %>% # Remove invalid UTF characters
mutate(Time = lubridate::dmy_hms(Time),
region = dirName_2,
sub_region = dirName_1,
code = paste(region, sub_region, sep = "_"),
) %>%
mutate(Time = lubridate::floor_date(Time, unit = "5 minutes")) %>%
group_by(Time, region, code) %>%
summarise(Value = mean(Value)) %>%
arrange(Time)
# There are some datasets which contain no values, whch can cause errors in running
# If this happens, return NULL
if(is.character(dataLoaded$Value)) return(NULL)
# Returns the loaded and cleaned dataframe
return(dataLoaded)
}
Nearest 5 minutes:
Amps_5mins <<- purrr::map_df(files_AMPS[1:4], processAMPS_5mins)