Skip to content
Snippets Groups Projects
Commit 2aec6236 authored by Ben Anderson's avatar Ben Anderson
Browse files

moved the data cleaning stuff to dataCleaning

parent b2721a7f
Branches
No related tags found
No related merge requests found
Showing
with 0 additions and 5765 deletions
---
title: 'Cleaning Electricity Substation Feeder Data'
subtitle: 'Code and notes'
author: "Ben Anderson (b.anderson@soton.ac.uk), [SERG, Energy & Climate Change](http://www.energy.soton.ac.uk/), University of Southampton"
date: 'Last run at: `r Sys.time()`'
output:
bookdown::pdf_document2:
fig_caption: yes
keep_tex: yes
number_sections: yes
toc: yes
toc_depth: 2
bookdown::word_document2:
fig_caption: yes
toc: yes
toc_depth: 2
bookdown::html_document2:
code_folding: hide
fig_caption: yes
number_sections: yes
self_contained: yes
toc: yes
toc_depth: 3
toc_float: yes
always_allow_html: yes
bibliography: '`r path.expand("~/bibliography.bib")`'
---
```{r setup}
# Knitr setup ----
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = FALSE) # for final tidy run
knitr::opts_chunk$set(message = FALSE) # for final tidy run
# Set start time ----
startTime <- proc.time()
# Libraries ----
library(data.table) # cos we like data.table (you may not in which case dplyr is fine :-)
library(lubridate) # for data & time manip
library(hms) # for hh:mm:ss if we need it
library(ggplot2) # fancy plots
library(kableExtra) # for extra kable
library(skimr) # for skim (data description)
# Parameters ----
dFile <- "~/Dropbox/Ben_IOW_SS.csv" # edit for your set up
# Functions ----
# set season
addSeason <- function(dt,dateVar,h){
# h = hemisphere (N or S)
# dateVar is a date that lubridate can convert to months
dt <- dt[, tmpM := lubridate::month(get(dateVar))] # sets 1 (Jan) - 12 (Dec). May already exist but we can't rely on it
if(h == "S"){
dt <- dt[, season := "Summer"] # easiest to set the default to be the one that bridges years
dt <- dt[tmpM >= 3 & tmpM <= 5, season := "Autumn"]
dt <- dt[tmpM >= 6 & tmpM <= 8 , season := "Winter"]
dt <- dt[tmpM >= 9 & tmpM <= 11, season := "Spring"]
# re-order to make sense
dt <- dt[, season := factor(season, levels = c("Spring", "Summer", "Autumn", "Winter"))]
}
if(h == "N"){
dt <- dt[, season := "Winter"] # easiest to set the default to be the one that bridges years
dt <- dt[tmpM >= 3 & tmpM <= 5, season := "Spring"]
dt <- dt[tmpM >= 6 & tmpM <= 8 , season := "Summer"]
dt <- dt[tmpM >= 9 & tmpM <= 11, season := "Autumn"]
# re-order to make sense
dt <- dt[, season := factor(season, levels = c("Spring", "Summer", "Autumn", "Winter"))]
}
dt$tmpM <- NULL
return(dt)
}
```
# Intro
We have some electricity substation feeder data. There seem to be NAs and missing time stamps. We want to select the 'best' (i.e most complete) days within a day-of-the-week/season/year sampling frame.
Code used to generate this report: https://git.soton.ac.uk/ba1e12/spatialec/-/blob/master/isleOfWight/cleaningFeederData.Rmd
# Data prep
## Load data
Loading from `r dFile`...
```{r loadData}
dt<- data.table::fread(dFile)
dt[, V1 := NULL] #drop
head(dt)
```
Do some data prep
```{r dataSummary}
dt[, rDateTime := lubridate::as_datetime(DateTime)]
dt[, rTime := hms::as_hms(rDateTime)]
dt[, rDate := as.Date(rDateTime)]
dt[, rYear := lubridate::year(rDate)]
dt[, rDoW := lubridate::wday(rDate, label = TRUE)]
dt[, kW := Value] # for clarity
dt[, Value := NULL] #drop
dt <- addSeason(dt, dateVar = "rDate", h = "N")
#skimr::skim(dt) doesn't play nicely with latex
st <- summary(dt[,.(rYear, rDateTime, sub_region, kW, rDoW, season)])
kableExtra::kable(st,
caption = "Summary of loaded and processed data") %>%
kable_styling()
```
Do a sense check by season & feeder.
```{r senseCheck}
t <- dt[, .(mean_kW = mean(kW),
nObs = .N), keyby = .(rYear, sub_region,season)]
NAt <- dt[is.na(kW), .(nObs = .N), keyby = .(rYear, sub_region,season)]
kableExtra::kable(t,
caption = "Sense check on raw data") %>%
kable_styling()
```
Histograms of kW by season and feeder...
```{r histo}
ggplot2::ggplot(dt, aes(x = kW)) +
geom_histogram(binwidth = 5) +
facet_grid(sub_region ~ season)
```
Is that what we expect to see?
Demand profiles of kW by season and feeder and day of the week...
```{r kwProfiles}
plotDT <- dt[, .(meankW = mean(kW)), keyby = .(rTime, season, sub_region, rDoW)]
ggplot2::ggplot(plotDT, aes(x = rTime, y = meankW, colour = season)) +
geom_line() +
facet_grid(sub_region ~ rDoW)
```
Is that what we expect?
# Test for missing
Can we see missing data?
```{r missingVis}
ggplot2::ggplot(dt, aes(x = rDate, y = rTime, fill = kW)) +
geom_tile() +
facet_grid(sub_region ~ .) +
scale_fill_viridis_c() +
labs(caption = paste0("All kW values by feeder, time (y) and date (x)",
"\nGrey (blank) regions = completely missing dateTimes")
)
```
oooh. That's not good.
Try aggregating to hours...
First the mean kw:
```{r aggVis}
dt[, rHour := lubridate::hour(rDateTime)]
plotDT <- dt[, .(mean_kW = mean(kW),
nObs = .N), keyby = .(rHour, rDate, sub_region)]
ggplot2::ggplot(plotDT, aes(x = rDate, y = rHour, fill = mean_kW)) +
geom_tile() +
scale_fill_viridis_c() +
facet_grid(sub_region ~ .) +
labs(caption = "Mean kW per hour")
```
Now the number of observations per hour. The cleaned data is at 15 minutes so we should expect up to 4 observations per hour. How close do we get?
```{r aggVisN}
ggplot2::ggplot(plotDT, aes(x = rDate, y = rHour, fill = nObs)) +
geom_tile() +
scale_fill_viridis_c() +
facet_grid(sub_region ~ .) +
labs(caption = paste0("Number of obs per hour (should be 4)",
"\nGrey (blank) regions = completely missing dateTimes")
)
```
Well one of them hits the 4 per hour most of the time. But what about the evenings? And what about the other feeders???
There is a suspicion that as mean kw goes up so do the number of observations per hour...
```{r aggVisBox}
ggplot2::ggplot(plotDT, aes(x = nObs, y = mean_kW, group = nObs)) +
geom_boxplot() +
facet_grid(.~sub_region) +
labs(caption = "mean kW per hour by number of observations contributing")
```
Hmm. Maybe...
# Which days have the 'least' missing?
This is quite tricky as we may have completely missing dateTimes. But we can test for this by counting the number of observations per dateTime and then seeing if the dateTimes are contiguous.
```{r testDateTimes}
dateTimesDT <- dt[, .(nFeeders = uniqueN(sub_region)),
keyby = .(rDateTime, rTime, rDate, season)] # keep season
dateTimesDT[, dtDiff := rDateTime - shift(rDateTime)] # should be 15 mins
summary(dateTimesDT)
```
Let's see how many unique feeders we have per dateTime. Surely we have at least one sending data each half-hour?
```{r tileFeeders}
ggplot2::ggplot(dateTimesDT, aes(x = rDate, y = rTime, fill = nFeeders)) +
geom_tile() +
scale_fill_viridis_c() +
labs(caption = "Number of unique feeders in each dateTime")
```
No. We can clearly have some dateTimes where we have no data _at all_!
Are there time of day patterns? It looks like it...
```{r missingProfiles}
plotDT <- dateTimesDT[, .(meanN = mean(nFeeders)), keyby = .(rTime, season)]
ggplot2::ggplot(plotDT, aes(y = meanN, x = rTime, colour = season)) +
geom_line() +
labs(y = "Mean n feeders reporting",
caption = "Mean n feeders by time of day")
```
Oh yes. Why?
# Summary
So...
I find it distinctly odd that:
* we appear to have the most feeders reporting data at 'peak' times
* we have a lot of missing dateTimes between 00:30 and 05:00
If the monitors were set to only collect data when the power (or Wh in a given time frame) was above a given threshold then it would look like this... That wouldn't happen... would it?
# Runtime
```{r check runtime, include=FALSE}
t <- proc.time() - startTime
elapsed <- t[[3]]
```
Analysis completed in `r round(elapsed,2)` seconds ( `r round(elapsed/60,2)` minutes) using [knitr](https://cran.r-project.org/package=knitr) in [RStudio](http://www.rstudio.com) with `r R.version.string` running on `r R.version$platform`.
# R environment
## R packages used
* base R [@baseR]
* bookdown [@bookdown]
* data.table [@data.table]
* ggplot2 [@ggplot2]
* kableExtra [@kableExtra]
* knitr [@knitr]
* lubridate [@lubridate]
* rmarkdown [@rmarkdown]
* skimr [@skimr]
## Session info
```{r sessionInfo, echo=FALSE}
sessionInfo()
```
# The data cleaning code
(c) Mikey Harper :-)
Starts here:
<hr>
Scripts used clean and merge substation data.
```{r setupMH, include=FALSE, eval=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(here)
library(tidyverse)
```
## Input files
Analysis will first look at the primary data. There are different types of files which refer to different paramters. Different search terms are used to extract these:
```{r, eval=FALSE}
# Find files with AMPS. Exclude files which contain DI~CO
files_AMPS <- list.files("../Primary", recursive = T, pattern = "~AMPS", full.names = T) %>%
.[!stringr::str_detect (., "DI~CO")]
files_AMPS
```
## Process Amps
```{r, eval=FALSE}
# Show a sample
fileSelect <- files_AMPS[4]
head(read_csv(fileSelect, skip = 3))
```
Again a function is used to do all the processing on the input CSVs. This is slightly amended from the `processkV` function.
```{r, eval=FALSE}
processAMPS <- function(filePath, databaseCon = con){
message("Processing ", filePath)
# 1st Level
dirName_1 <- filePath %>%
dirname() %>%
basename
# 2nd Level
dirName_2 <- filePath %>%
dirname() %>%
dirname() %>%
basename
if (dirName_2 == "Primary"){
dirName_2 <- dirName_1
dirName_1 <- ""
}
# Load the CSV. There were some tab seperated files which are saved as CSVs, which confuse the search. There if the data is loaded incorrectly (only having a single column), the code will try and load it as a TSV.
dataLoaded <- suppressWarnings(read_csv(filePath, skip = 3, col_types = cols(Value = col_number())))
if(ncol(dataLoaded) == 1){
dataLoaded <- suppressWarnings(read_tsv(filePath, skip = 3, col_types = cols()))
}
# Reformat data
dataLoaded <-
dataLoaded %>%
mutate_at(vars(Time), function(x){gsub('[^ -~]', '', x)}) %>% # Remove invalid UTF characters
mutate(Time = lubridate::dmy_hms(Time),
Time = lubridate::floor_date(Time, unit = "15 minutes")) %>%
group_by(Time) %>%
summarise(Value = mean(Value, na.rm = T)) %>%
mutate(region = dirName_2,
sub_region = dirName_1
)
# There are some datasets which contain no values, whch can cause errors in running
# If this happens, return NULL
if(is.character(dataLoaded$Value)) return(NULL)
return(dataLoaded)
}
```
Run the function below:
```{r, eval=FALSE}
Amps <- purrr::map_df(files_AMPS, processAMPS)
```
```{r, eval=FALSE}
Amps_stats <- Amps %>%
group_by(region) %>%
summarise(mean = (mean(Value, na.rm = T)),
n = n(),
sd = sd(Value, na.rm = T),
var = var(Value, na.rm = T))
Amps_stats
readr::write_csv(Amps_stats, path = "../Amps_stats.csv")
```
```{r, eval=FALSE}
ggplot(Amps) +
geom_point(aes(x = Time, y = Value, colour = region)) +
facet_grid(region~., scales = "free_y") +
labs(title = "Cleaned data for Amps")
```
## Processing data
```{r, eval=FALSE}
readr::write_csv(Amps, path = "amps_all_substations.csv")
```
```{r, eval=FALSE}
library(odbc)
library(DBI)
# Create an ephemeral in-memory RSQLite database
con <- dbConnect(RSQLite::SQLite(), "amps.sqlite")
dbListTables(con)
dbWriteTable(con, "amps", Amps)
dbListTables(con)
```
## Querying the data
```{r, eval=FALSE}
con <- dbConnect(RSQLite::SQLite(), "amps.sqlite")
library(dbplyr)
Amps_db <- tbl(con, "amps")
flights_db %>%
group_by(region) %>%
summarise(mean = (mean(Value, na.rm = T)),
n = n(),
sd = sd(Value, na.rm = T),
var = var(Value, na.rm = T))
```
## Round to Nearest N minutes
```{r, eval=FALSE}
processAMPS_5mins <- function(filePath){
message("Processing ", filePath)
# 1st Level
dirName_1 <- filePath %>%
dirname() %>%
basename
# 2nd Level
dirName_2 <- filePath %>%
dirname() %>%
dirname() %>%
basename
if (dirName_2 == "Primary"){
dirName_2 <- dirName_1
dirName_1 <- ""
}
# Load the CSV. There were some tab seperated files which are saved as CSVs, which confuse the search. There if the data is loaded incorrectly (only having a single column), the code will try and load it as a TSV.
dataLoaded <- suppressWarnings(read_csv(filePath, skip = 3, col_types = cols()))
if(ncol(dataLoaded) == 1){
dataLoaded <- suppressWarnings(read_tsv(filePath, skip = 3, col_types = cols()))
}
# Reformat data
dataLoaded <-
dataLoaded %>%
mutate_at(vars(Time), function(x){gsub('[^ -~]', '', x)}) %>% # Remove invalid UTF characters
mutate(Time = lubridate::dmy_hms(Time),
region = dirName_2,
sub_region = dirName_1,
code = paste(region, sub_region, sep = "_"),
) %>%
mutate(Time = lubridate::floor_date(Time, unit = "5 minutes")) %>%
group_by(Time, region, code) %>%
summarise(Value = mean(Value)) %>%
arrange(Time)
# There are some datasets which contain no values, whch can cause errors in running
# If this happens, return NULL
if(is.character(dataLoaded$Value)) return(NULL)
# Returns the loaded and cleaned dataframe
return(dataLoaded)
}
```
Nearest 5 minutes:
```{r, eval=FALSE}
Amps_5mins <<- purrr::map_df(files_AMPS[1:4], processAMPS_5mins)
```
# References
This diff is collapsed.
This diff is collapsed.
File deleted
This diff is collapsed.
File deleted
File deleted
File deleted
File deleted
File deleted
File deleted
File deleted
File deleted
File deleted
File deleted
File deleted
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment