Commit b02acfb1 authored by Ben Anderson's avatar Ben Anderson
Browse files

initial commit with Ellis' feeder cleaning data example in /rmd

parent 55451527
^.*\.Rproj$
^\.Rproj\.user$
.Rproj.user
.Rhistory
.RData
.Ruserdata
Be nice:
* Please **do not** make changes to the repo master so:
* fork the repo
* work in a branch of your fork
* make a pull request to merge your branch to the master
Don't know how to do this? [Read our guide](https://git.soton.ac.uk/SERG/workflow/howTo/gitBranches.md).
Need more info? Read the excellent [contributing to opensource code](https://opensource.guide/how-to-contribute/) guidance.
\ No newline at end of file
Package: dataCleaning
Type: Package
Title: A place to store hints, tips & examples of data cleaning with R #ymmv
Version: 0.1.0
Authors@R:
c(person(given = "Ben",
family = "Anderson",
role = c("aut", "cre"),
email = "b.anderson@soton.ac.uk")
)
Description: A repo that collects various guidance and resources for people trying to clean data with R, especially large energy monitoring datasets with dateTimes
License: MIT License
Imports:
here
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.1.0
\ No newline at end of file
MIT License
Copyright (c) 2020 University of Southampton
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# Generated by roxygen2: do not edit by hand
export(addSeason)
export(loadLibraries)
export(setup)
export(tidyNum)
importFrom(utils,install.packages)
#' Add season to a data.table depending on hemisphere
#'
#' \code{addSeason} returns a dt with hemisphere season (Winter, Spring, Summer, Autumn) added - assume temperature latitudes
#'
#' @param dt the data table
#' @param dateVar the column in the dt which is a date that lubridate::month() will work on
#' @param h hemisphere: North (N) or South (S)?
#'
#' @author Ben Anderson, \email{b.anderson@@soton.ac.uk}
#' @export
#' @family utils
#'
addSeason <- function(dt,dateVar,h){
dt <- dt[, tmpM := lubridate::month(get(dateVar))] # sets 1 (Jan) - 12 (Dec). May already exist but we can't rely on it
if(h == "S"){
dt <- dt[, season := "Summer"] # easiest to set the default to be the one that bridges years
dt <- dt[tmpM >= 3 & tmpM <= 5, season := "Autumn"]
dt <- dt[tmpM >= 6 & tmpM <= 8 , season := "Winter"]
dt <- dt[tmpM >= 9 & tmpM <= 11, season := "Spring"]
# re-order to make sense
dt <- dt[, season := factor(season, levels = c("Spring", "Summer", "Autumn", "Winter"))]
}
if(h == "N"){
dt <- dt[, season := "Winter"] # easiest to set the default to be the one that bridges years
dt <- dt[tmpM >= 3 & tmpM <= 5, season := "Spring"]
dt <- dt[tmpM >= 6 & tmpM <= 8 , season := "Summer"]
dt <- dt[tmpM >= 9 & tmpM <= 11, season := "Autumn"]
# re-order to make sense
dt <- dt[, season := factor(season, levels = c("Spring", "Summer", "Autumn", "Winter"))]
}
dt$tmpM <- NULL
return(dt)
}
\ No newline at end of file
#' Installs and loads packages
#'
#' \code{loadLibraries} checks whether the package is already installed,
#' installing those which are not preinstalled. All the libraries are then loaded.
#'
#' Especially useful when running on virtual machines where package installation
#' is not persistent (Like UoS sve). It will fail if the packages need to be
#' installed but there is no internet access.
#'
#' NB: in R 'require' tries to load a package but throws a warning & moves on if it's not there
#' whereas 'library' throws an error if it can't load the package. Hence 'loadLibraries'
#' https://stackoverflow.com/questions/5595512/what-is-the-difference-between-require-and-library
#' @param ... A list of packages
#' @param repo The repository to load functions from. Defaults to "https://cran.rstudio.com"
#' @importFrom utils install.packages
#'
#' @author Luke Blunden, \email{lsb@@soton.ac.uk} (original)
#' @author Michael Harper \email{m.harper@@soton.ac.uk} (revised version)
#' @export
#'
loadLibraries <- function(..., repo = "https://cran.rstudio.com"){
packages <- c(...)
# Check if package is installed
newPackages <- packages[!(packages %in% utils::installed.packages()[,1])]
# Install if required
if (length(newPackages)){utils::install.packages(newPackages, dependencies = TRUE)}
# Load packages
sapply(packages, require, character.only = TRUE)
}
\ No newline at end of file
#' Sets package parameters etc
#'
#' \code{setup} loads global package parameters and settings via env.R
#'
#'
#' @author Ben Anderson, \email{b.anderson@@soton.ac.uk}
#' @export
#'
setup <- function() {
source(paste0(here::here(), "/env.R"))
}
#' Tidy long numbers
#'
#' \code{tidyNum} reformats long numbers to include commas and prevents scientific formats
#'
#' @param number an input number or list
#'
#' @author Ben Anderson, \email{b.anderson@@soton.ac.uk}
#' @export
#'
tidyNum <- function(number) {
format(number, big.mark=",", scientific=FALSE)
}
\ No newline at end of file
# dataCleaning
A place to store hints, tips and examples for data cleaning. We use a lot of very dirty data.
\ No newline at end of file
A place to store hints, tips and examples for data cleaning. We use a lot of very dirty data.
This repo is an R package. If you clone it you can build it and use the functions.
We'd love your contributions - feel free to:
* fork & go
* make a new branch in your fork
* make some improvements
* send us a pull request
\ No newline at end of file
Version: 1.0
RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX
BuildType: Package
PackageUseDevtools: Yes
PackageInstallArgs: --no-multiarch --with-keep.source
PackageRoxygenize: rd,collate,namespace,vignette
# parameters widely re-used across the repo
# here's why a project level .Rprofile should not be used for this job https://support.rstudio.com/hc/en-us/articles/360047157094-Managing-R-with-Rprofile-Renviron-Rprofile-site-Renviron-site-rsession-conf-and-repos-conf
repoParams <- list() # params holder as a list. Sooooo much easier with tab complete
listFiles <- FALSE
# Location ----
# attempt to guess the platform & user
repoParams$info <- Sys.info()
repoParams$sysname <- repoParams$info[[1]]
repoParams$nodename <- repoParams$info[[4]]
repoParams$login <- repoParams$info[[6]]
repoParams$user <- repoParams$info[[7]]
# use this guess to set a data path
# default
dPath <- "** unkown - find this line in env.R and fix things! **"
if (repoParams$nodename == "srv02405") {
# we're on the UoS RStudio server so we can set the data path to J: (as mounted)
dPath <- path.expand("/mnt/SERG_data/")
listFiles <- TRUE # we know where the data is
}
message("You're ", repoParams$user, " using " , repoParams$sysname, " on ", repoParams$nodename)
message("Default dPath has been set to: ", dPath)
if (listFiles ) { # optional
message("and these are the files/folders in that dPath:")
try(list.files(dPath)) # in case it breaks
}
message("Hopefully that's what you expected...")
\ No newline at end of file
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/addSeason.R
\name{addSeason}
\alias{addSeason}
\title{Add season to a data.table depending on hemisphere}
\usage{
addSeason(dt, dateVar, h)
}
\arguments{
\item{dt}{the data table}
\item{dateVar}{the column in the dt which is a date that lubridate::month() will work on}
\item{h}{hemisphere: North (N) or South (S)?}
}
\description{
\code{addSeason} returns a dt with hemisphere season (Winter, Spring, Summer, Autumn) added - assume temperature latitudes
}
\author{
Ben Anderson, \email{b.anderson@soton.ac.uk}
}
\concept{utils}
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/loadLibraries.R
\name{loadLibraries}
\alias{loadLibraries}
\title{Installs and loads packages}
\usage{
loadLibraries(..., repo = "https://cran.rstudio.com")
}
\arguments{
\item{...}{A list of packages}
\item{repo}{The repository to load functions from. Defaults to "https://cran.rstudio.com"}
}
\description{
\code{loadLibraries} checks whether the package is already installed,
installing those which are not preinstalled. All the libraries are then loaded.
}
\details{
Especially useful when running on virtual machines where package installation
is not persistent (Like UoS sve). It will fail if the packages need to be
installed but there is no internet access.
NB: in R 'require' tries to load a package but throws a warning & moves on if it's not there
whereas 'library' throws an error if it can't load the package. Hence 'loadLibraries'
https://stackoverflow.com/questions/5595512/what-is-the-difference-between-require-and-library
}
\author{
Luke Blunden, \email{lsb@soton.ac.uk} (original)
Michael Harper \email{m.harper@soton.ac.uk} (revised version)
}
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/setup.R
\name{setup}
\alias{setup}
\title{Sets package parameters etc}
\usage{
setup()
}
\description{
\code{setup} loads global package parameters and settings via env.R
}
\author{
Ben Anderson, \email{b.anderson@soton.ac.uk}
}
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/tidyNum.R
\name{tidyNum}
\alias{tidyNum}
\title{Tidy long numbers}
\usage{
tidyNum(number)
}
\arguments{
\item{number}{an input number or list}
}
\description{
\code{tidyNum} reformats long numbers to include commas and prevents scientific formats
}
\author{
Ben Anderson, \email{b.anderson@soton.ac.uk}
}
File added
---
title: 'Cleaning Electricity Substation Feeder Data'
subtitle: 'Code and notes'
author: "Ben Anderson (b.anderson@soton.ac.uk), [SERG, Energy & Climate Change](http://www.energy.soton.ac.uk/), University of Southampton"
date: 'Last run at: `r Sys.time()`'
output:
bookdown::pdf_document2:
fig_caption: yes
keep_tex: yes
number_sections: yes
toc: yes
toc_depth: 2
bookdown::word_document2:
fig_caption: yes
toc: yes
toc_depth: 2
bookdown::html_document2:
code_folding: hide
fig_caption: yes
number_sections: yes
self_contained: yes
toc: yes
toc_depth: 3
toc_float: yes
always_allow_html: yes
bibliography: '`r path.expand("~/bibliography.bib")`'
---
```{r setup}
# Knitr setup ----
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = FALSE) # for final tidy run
knitr::opts_chunk$set(message = FALSE) # for final tidy run
# Set start time ----
startTime <- proc.time()
# Libraries ----
library(dataCleaning)
dataCleaning::setup()
reqLibs <- c("data.table", # cos we like data.table (you may not in which case dplyr is fine :-)
"lubridate", # for data & time manip
"hms", # for hh:mm:ss if we need it
"ggplot2", # fancy plots
"kableExtra", # for extra kable
"skimr")# for skim (data description)
dataCleaning::loadLibraries(reqLibs) # Use Luke's function to load them
# Parameters ----
dFile <- "~/Dropbox/Ben_IOW_SS.csv" # edit for your set up
# Functions ----
# put more general ones that could be useful to everyone in /R so they are built into the package.
# put functions relevant to this analysis here
# add a season - could become a package function?
addSeason <- function(dt,dateVar,h){
dt <- dt[, tmpM := lubridate::month(get(dateVar))] # sets 1 (Jan) - 12 (Dec). May already exist but we can't rely on it
if(h == "S"){
dt <- dt[, season := "Summer"] # easiest to set the default to be the one that bridges years
dt <- dt[tmpM >= 3 & tmpM <= 5, season := "Autumn"]
dt <- dt[tmpM >= 6 & tmpM <= 8 , season := "Winter"]
dt <- dt[tmpM >= 9 & tmpM <= 11, season := "Spring"]
# re-order to make sense
dt <- dt[, season := factor(season, levels = c("Spring", "Summer", "Autumn", "Winter"))]
}
if(h == "N"){
dt <- dt[, season := "Winter"] # easiest to set the default to be the one that bridges years
dt <- dt[tmpM >= 3 & tmpM <= 5, season := "Spring"]
dt <- dt[tmpM >= 6 & tmpM <= 8 , season := "Summer"]
dt <- dt[tmpM >= 9 & tmpM <= 11, season := "Autumn"]
# re-order to make sense
dt <- dt[, season := factor(season, levels = c("Spring", "Summer", "Autumn", "Winter"))]
}
dt$tmpM <- NULL
return(dt)
}
```
# Intro
We have some electricity substation feeder data that has been cleaned to give mean kW per 15 minutes.
There seem to be some NA kW values and a lot of missing time stamps. We want to select the 'best' (i.e most complete) days within a day-of-the-week/season/year sampling frame. If we can't do that we may have to resort to seasonal mean kW profiles by hour & day of the week...
Code used to generate this report: https://git.soton.ac.uk/ba1e12/spatialec/-/blob/master/isleOfWight/cleaningFeederData.Rmd
# Data prep
## Load data
Loading from `r dFile`...
```{r loadData}
feederDT <- data.table::fread(dFile) # fast
feederDT[, V1 := NULL] #drop
head(feederDT)
```
Do some data prep
```{r dataPrep}
feederDT[, rDateTime := lubridate::as_datetime(DateTime)]
feederDT[, rTime := hms::as_hms(rDateTime)]
feederDT[, rDate := as.Date(rDateTime)]
feederDT[, rYear := lubridate::year(rDate)]
feederDT[, rDoW := lubridate::wday(rDate, label = TRUE)]
feederDT[, kW := Value] # for clarity
feederDT[, Value := NULL] #drop
feederDT <- addSeason(feederDT,
dateVar = "rDateTime",
h = "N")
#skimr::skim(feederDT) doesn't play nicely with latex
st <- summary(feederDT[,.(rYear, rDateTime,
sub_region, kW, rDoW, season)])
kableExtra::kable(st,
caption = "Summary of loaded and processed data") %>%
kable_styling()
```
Do a sense check by season & feeder.
```{r senseCheck}
t <- feederDT[, .(mean_kW = mean(kW),
nObs = .N), keyby = .(rYear, sub_region,season)]
NAt <- feederDT[is.na(kW), .(nObs = .N), keyby = .(rYear, sub_region,season)]
kableExtra::kable(t,
caption = "Sense check on raw data") %>%
kable_styling()
```
Histograms of kW by season and feeder...
```{r histo}
ggplot2::ggplot(feederDT, aes(x = kW)) +
geom_histogram(binwidth = 5) +
facet_grid(sub_region ~ season)
```
Is that what we expect to see?
Try looking at the distribution of kW by day of the week and time by feeder and season. This is all data - so it may take some time to plot if you have a lot of data. The smoothed lines are per season/feeder/day generated by ggplot using mgcv::gam.
```{r smoothedProfiles}
ggplot2::ggplot(feederDT, aes(x = rTime, y = kW,
colour = season)) +
geom_point(size = 0.2, stroke = 1, shape = 16) +
geom_smooth() + # uses mgcv::gam unless n < 1000
facet_grid(sub_region ~ rDoW)
```
There are clearly some outliers but the shapes look right... **What is happening with E2L5??**
Try aggregated demand profiles of mean kW by season and feeder and day of the week...
```{r kwProfiles}
plotDT <- feederDT[, .(meankW = mean(kW)), keyby = .(rTime, season, sub_region, rDoW)]
ggplot2::ggplot(plotDT, aes(x = rTime, y = meankW, colour = season)) +
geom_line() +
facet_grid(sub_region ~ rDoW)
```
Is that what we expect?
# Test for missing
Can we see missing data?
```{r missingVis}
ggplot2::ggplot(feederDT, aes(x = rDate, y = rTime, fill = kW)) +
geom_tile() +
facet_grid(sub_region ~ .) +
scale_fill_viridis_c() +
labs(caption = paste0("All kW values by feeder, time (y) and date (x)",
"\nGrey (blank) regions = completely missing dateTimes")
)
```
oooh. That's not good. There shouldn't be any blank dateTimes.
Try aggregating to hours...
First the mean kw:
```{r aggVis}
feederDT[, rHour := lubridate::hour(rDateTime)] # add hour to the data
plotDT <- feederDT[, .(mean_kW = mean(kW), # aggregate
nObs = .N),
keyby = .(rHour, rDate, sub_region)]
ggplot2::ggplot(plotDT, aes(x = rDate, y = rHour,
fill = mean_kW)) + #plot mean kW
geom_tile() +
scale_fill_viridis_c() +
facet_grid(sub_region ~ .) +
labs(caption = "Mean kW per hour")
```
Now the number of observations per hour. The cleaned data is at 15 minutes so we should expect up to 4 observations per hour. How close do we get?
```{r aggVisN}
ggplot2::ggplot(plotDT, aes(x = rDate, y = rHour, fill = nObs)) +
geom_tile() +
scale_fill_viridis_c() +
facet_grid(sub_region ~ .) +
labs(caption = paste0("Number of obs per hour (should be 4)",
"\nGrey (blank) regions = completely missing dateTimes")
)
```
Well one of them hits the 4 per hour most of the time. But what about the evenings? And what about the other feeders???
There is a suspicion that as mean kw goes up so do the number of observations per hour...
```{r aggVisBox}
ggplot2::ggplot(plotDT, aes(x = nObs, y = mean_kW, group = nObs)) +
geom_boxplot() +
facet_grid(.~sub_region) +
labs(caption = "mean kW per hour by number of observations contributing")
```
Hmm. Maybe...
# Which days have the 'least' missing?
This is quite tricky as we may have completely missing dateTimes. But we can test for this by counting the number of observations per dateTime and then seeing if the dateTimes are contiguous.
```{r testDateTimes}
dateTimesDT <- feederDT[, .(nFeeders = uniqueN(sub_region)),
keyby = .(rDateTime, rTime, rDate, season)] # keep season
dateTimesDT[, dtDiff := rDateTime - shift(rDateTime)] # should be 15 mins
summary(dateTimesDT)
```
Let's see how many unique feeders we have per dateTime. Surely we have at least one sending data each half-hour?
```{r tileFeeders}
ggplot2::ggplot(dateTimesDT, aes(x = rDate, y = rTime, fill = nFeeders)) +
geom_tile() +
scale_fill_viridis_c() +
labs(caption = "Number of unique feeders in each dateTime")
```
No. We can clearly have some dateTimes where we have no data _at all_!
Are there time of day patterns? It looks like it...
```{r missingProfiles}
plotDT <- dateTimesDT[, .(meanN = mean(nFeeders)), keyby = .(rTime, season)]
ggplot2::ggplot(plotDT, aes(y = meanN, x = rTime, colour = season)) +
geom_line() +
labs(y = "Mean n feeders reporting",
caption = "Mean n feeders by time of day")
```
Oh yes. Why?
# Summary
So...
I find it distinctly odd that:
* we appear to have the most feeders reporting data at 'peak' times
* we have a lot of missing dateTimes between 00:30 and 05:00
If the monitors were set to only collect data when the power (or Wh in a given time frame) was above a given threshold then it would look like this... That wouldn't happen... would it?
# Runtime
```{r check runtime, include=FALSE}
t <- proc.time() - startTime
elapsed <- t[[3]]
```
Analysis completed in `r round(elapsed,2)` seconds ( `r round(elapsed/60,2)` minutes) using [knitr](https://cran.r-project.org/package=knitr) in [RStudio](http://www.rstudio.com) with `r R.version.string` running on `r R.version$platform`.
# R environment
## R packages used
* base R [@baseR]
* bookdown [@bookdown]
* data.table [@data.table]
* ggplot2 [@ggplot2]
* kableExtra [@kableExtra]
* knitr [@knitr]
* lubridate [@lubridate]
* rmarkdown [@rmarkdown]
* skimr [@skimr]
## Session info
```{r sessionInfo, echo=FALSE}
sessionInfo()
```
# The data cleaning code
(c) Mikey Harper :-)