Administrator approval is now required for registering new accounts. If you are registering a new account, and are external to the University, please ask the repository owner to contact ServiceLine to request your account be approved. Repository owners must include the newly registered email address, and specific repository in the request for approval.

README.md 2.17 KB
Newer Older
Ben Anderson's avatar
Ben Anderson committed
1 2
# dataCleaning

Ben Anderson's avatar
Ben Anderson committed
3
A place to store hints, tips and examples for data cleaning. We use a lot of very dirty data which often has outliers and missing observations. Since most of this data is large scale 'sensor' data with time stamps we make a lot of use of these R packages to process and visualise the data so we can see what is odd and what is missing:
Ben Anderson's avatar
Ben Anderson committed
4

Ben Anderson's avatar
Ben Anderson committed
5 6 7
 * [data.table](https://rdatatable.gitlab.io/data.table/) - very fast data loading and wrangling
 * [lubridate](https://lubridate.tidyverse.org/) - _the_ way to do dates and dateTimes in R
 * [hms](https://hms.tidyverse.org/) - deals with time (HH:MM:SS)
Ben Anderson's avatar
Ben Anderson committed
8
 * [ggplot2](https://ggplot2.tidyverse.org/) - plots, especially using [geom_tile()](https://ggplot2.tidyverse.org/reference/geom_tile.html) with date on the x axis, time of day on the y and 'fill' set to the sensor value that _should_ be there. This shows up non-random (and random) data holes like [these](https://git.soton.ac.uk/SERG/datacleaning/-/blob/master/docs/report_cleanFeeders_allData.pdf) very nicely.
9

Ben Anderson's avatar
Ben Anderson committed
10 11 12 13
This repo is an R package. This means:

 * package functions are kept in /R
 * help files auto-created by roxygen are in /man
Ben Anderson's avatar
Ben Anderson committed
14
 * if you clone it you can build it and use the functions
Ben Anderson's avatar
Ben Anderson committed
15
 * we use drake like this:
Ben Anderson's avatar
Ben Anderson committed
16
     * make_XX.R contains a call to drake::r_make(source = "_drake_XX.R")
Ben Anderson's avatar
Ben Anderson committed
17
     * _drake_XX.R contains the drake plan and the functions & package loading. This is not quite what the [drake book](https://books.ropensci.org/drake/projects.html#usage) recommends but it works for us
Ben Anderson's avatar
Ben Anderson committed
18
     * Rmd scripts called by the drake plan to report results are kept in /Rmd
Ben Anderson's avatar
Ben Anderson committed
19
     * outputs are kept in /docs (reports, plots etc)
Ben Anderson's avatar
Ben Anderson committed
20
     * if you can, **run Rscript ./make_cleanFeeders.R in a terminal not at the RStudio console** <- this stops RStudio from locking up
Ben Anderson's avatar
Ben Anderson committed
21
 * you (and we) keep your data out of it!
22 23 24

We'd love your contributions - feel free to:

Ben Anderson's avatar
Ben Anderson committed
25 26
 * [fork & go](https://happygitwithr.com/fork-and-clone.html)
 * make a [new branch](https://git.soton.ac.uk/SERG/workflow/-/blob/master/howTo/gitBranches.md) in your fork
27
 * make some improvements
Ben Anderson's avatar
Ben Anderson committed
28
 * send us a pull request (just code, no data please, keep your data [elsewhere](https://git.soton.ac.uk/SERG/workflow/-/blob/master/howTo/keepingData.md)!)