README.md 2.2 KB
Newer Older
Ben Anderson's avatar
Ben Anderson committed
1
2
# dataCleaning

Ben Anderson's avatar
Ben Anderson committed
3
A place to store hints, tips and examples for data cleaning. We use a lot of very dirty data which often has outliers and missing observations. Since most of this data is large scale 'sensor' data with time stamps we make a lot of use of these R packages to process and visualise the data so we can see what is odd and what is missing:
Ben Anderson's avatar
Ben Anderson committed
4

Ben Anderson's avatar
Ben Anderson committed
5
6
7
 * [data.table](https://rdatatable.gitlab.io/data.table/) - very fast data loading and wrangling
 * [lubridate](https://lubridate.tidyverse.org/) - _the_ way to do dates and dateTimes in R
 * [hms](https://hms.tidyverse.org/) - deals with time (HH:MM:SS)
Ben Anderson's avatar
Ben Anderson committed
8
 * [ggplot2](https://ggplot2.tidyverse.org/) - plots, especially using [geom_tile()](https://ggplot2.tidyverse.org/reference/geom_tile.html) with date on the x axis, time of day on the y and 'fill' set to the sensor value that _should_ be there. This shows up non-random (and random) data holes like [these](https://git.soton.ac.uk/SERG/datacleaning/-/blob/master/rmd/cleaningFeederData_files/figure-latex/missingVis-1.pdf) very nicely.
9

Ben Anderson's avatar
Ben Anderson committed
10
11
12
13
This repo is an R package. This means:

 * package functions are kept in /R
 * help files auto-created by roxygen are in /man
Ben Anderson's avatar
Ben Anderson committed
14
 * if you clone it you can build it and use the functions
Ben Anderson's avatar
Ben Anderson committed
15
 * we use drake like this:
Ben Anderson's avatar
Ben Anderson committed
16
17
18
     * make_XX.R contains a call to drake::r_make(source = "_drake_XX.R")
     * _drake_XX.R contans the drake plan and the functions & package loading. This is not quite what the [drake book](https://books.ropensci.org/drake/projects.html#usage) recommends but it works for us
     * Rmd scripts called by the drake plan to report results are kept in /Rmd
Ben Anderson's avatar
Ben Anderson committed
19
     * outputs are kept in /docs (reports, plots etc)
Ben Anderson's avatar
Ben Anderson committed
20
     * if you can, **run Rscript ./make_cleanFeeders.R in a terminal not at the RStudio console** <- this stops RStudio from locking up
Ben Anderson's avatar
Ben Anderson committed
21
 * you (and we) keep your data out of it!
22
23
24

We'd love your contributions - feel free to:

Ben Anderson's avatar
Ben Anderson committed
25
26
 * [fork & go](https://happygitwithr.com/fork-and-clone.html)
 * make a [new branch](https://git.soton.ac.uk/SERG/workflow/-/blob/master/howTo/gitBranches.md) in your fork
27
 * make some improvements
Ben Anderson's avatar
Ben Anderson committed
28
 * send us a pull request (just code, no data please, keep your data [elsewhere](https://git.soton.ac.uk/SERG/workflow/-/blob/master/howTo/otherResources.md)!)