Administrator approval is now required for registering new accounts. If you are registering a new account, and are external to the University, please ask the repository owner to contact ServiceLine to request your account be approved. Repository owners must include the newly registered email address, and specific repository in the request for approval.

message("Unique data nrows: ", tidyNum(nrow(uniqDataDT)))

kableExtra::kable(t,

caption = "Sense check on raw data") %>%

kable_styling()

```

message("So we have ", tidyNum(nrow(origDataDT) - nrow(uniqDataDT)), " duplicates...")

Histograms of kW by season and feeder...

pc <- 100*((nrow(origDataDT) - nrow(uniqDataDT))/nrow(origDataDT))

message("That's ", round(pc,2), "%")

```{r histo}

ggplot2::ggplot(feederDT, aes(x = kW)) +

geom_histogram(binwidth = 5) +

facet_grid(sub_region ~ season)

feederDT <- uniqDataDT[!is.na(rDateTime)] # use dt with no duplicates

origDataDT <- NULL # save memory

```

Is that what we expect to see?

There were `r tidyNum(nrow(origDataDT) - nrow(uniqDataDT))` duplicates - that's `r round(pc,2)` % of the observations loaded.

Try looking at the distribution of kW by day of the week and time by feeder and season. This is all data - so it may take some time to plot if you have a lot of data. The smoothed lines are per season/feeder/day generated by ggplot using mgcv::gam.

So we remove the duplicates...

```{r smoothedProfiles}

ggplot2::ggplot(feederDT, aes(x = rTime, y = kW,

colour = season)) +

geom_point(size = 0.2, stroke = 1, shape = 16) +

geom_smooth() + # uses mgcv::gam unless n < 1000

facet_grid(sub_region ~ rDoW)

```

# Basic patterns

There are clearly some outliers but the shapes look right... **What is happening with E2L5??**

Try aggregated demand profiles of mean kW by season and feeder and day of the week... Remove the legend so we can see the plot.

Try aggregated demand profiles of mean kW by season and feeder and day of the week...

theme(legend.position="none") + # remove legend so we can see the plot

facet_grid(season ~ rDoW)

```

Is that what we expect?

# Test for missing

Can we see missing data?

Number of observations per feeder per day - gaps will be visible (totally missing days) as will low counts (partially missing days) - we would expect 24 * 4... Convert this to a % of expected...

```{r missingVis}

ggplot2::ggplot(feederDT, aes(x = rDate, y = rTime, fill = kW)) +

oooh. That's not good. There shouldn't be any blank dateTimes.

This is not good. There are both gaps (missing days) and partial days. **Lots** of partial days. Why is the data relatively good up to the end of 2003?

What does it look like if we aggregate across all feeders by time? There are `r uniqueN(feederDT$feeder_ID)` feeders so we should get this many at best How close do we get?

This is quite tricky as we may have completely missing dateTimes. But we can test for this by counting the number of observations per dateTime and then seeing if the dateTimes are contiguous.

Those look as we'd expect. But do we see a correlation between the number of observations per hour and the mean kW after 2003? There is a suspicion that as mean kw goes up so do the number of observations per hour... although this could just be a correlation with low demand periods (night time?)

Yes. The higher the kW, the more observations we get from 2004 onwards. Why?

* we appear to have the most feeders reporting data at 'peak' times

* we have a lot of missing dateTimes between 00:30 and 05:00

It is distinctly odd that after 2003:

* we appear to have the most feeders reporting data at 'peak' times

* we have a lot of missing dateTimes between 00:30 and 05:00

If the monitors were set to only collect data when the power (or Wh in a given time frame) was above a given threshold then it would look like this... That wouldn't happen... would it?

# Selecting the 'best' days

Here we use a wide form of the feeder data which has each feeder as a column.

We should have `r uniqueN(feederDT$feeder_ID)` feeders. We want to find days when all of these feeders have complete data.

The wide dataset has a count of NAs per row (dateTime) from which we infer how many feeders are reporting:

```{r}

wDT <- drake::readd(wideData) # back from the drake

names(wDT)

```

If we take the mean of the number of feeders reporting per day (date) then a value of 25 will indicate a day when _all_ feeders have _all_ data (since it would be the mean of all the '25's).

```{r testWide}

wDT <- addSeason(wDT, dateVar = "rDateTime", h = "N")

wDT[, rDoW := lubridate::wday(rDateTime)]

wDT[, rDate := lubridate::date(rDateTime)]

# how many days have all feeders sending data in all dateTimes?

Re-plot by the % of expected if we assume we _should_ have n feeders * 24 hours * 4 per hour (will be the same shape). This also tells us that there is some reason why we get fluctations in the number of data points per hour after 2003.

For fun we then print 4 tables of the 'best' days per season.