Similar to [Dark Data](https://press.princeton.edu/books/hardcover/9780691182377/dark-data) but possibly even nastier...
# dataCleaning
A place to store hints, tips and examples for data cleaning. We use a lot of very dirty data which often has outliers and missing observations. Since most of this data is large scale 'sensor' data with time stamps we make a lot of use of these R packages to process and visualise the data so we can see what is odd and what is missing:
We use a lot of very dirty data which often has outliers and missing observations. Since most of this data is large scale 'sensor' data with time stamps we make a lot of use of these R packages to process and visualise the data so we can see what is odd and what is missing:
*[data.table](https://rdatatable.gitlab.io/data.table/) - very fast data loading and wrangling
*[lubridate](https://lubridate.tidyverse.org/) - _the_ way to do dates and dateTimes in R
*[hms](https://hms.tidyverse.org/) - deals with time (HH:MM:SS)
*[ggplot2](https://ggplot2.tidyverse.org/) - plots, especially using [geom_tile()](https://ggplot2.tidyverse.org/reference/geom_tile.html) with date on the x axis, time of day on the y and 'fill' set to the sensor value that _should_ be there. This shows up non-random (and random) data holes like [these](https://git.soton.ac.uk/SERG/datacleaning/-/blob/master/docs/report_cleanFeeders_allData.pdf) very nicely.
# useIt
This repo is an R package. This means:
* package functions are kept in /R
...
...
@@ -20,6 +26,8 @@ This repo is an R package. This means:
* if you can, **run Rscript ./make_cleanFeeders.R in a terminal not at the RStudio console** <- this stops RStudio from locking up