dataCleaning
A place to store hints, tips and examples for data cleaning. We use a lot of very dirty data which often has outliers and missing observations. Since most of this data is large scale 'sensor' data with time stamps we make a lot of use of these R packages to process and visualise the data so we can see what is odd and what is missing:
- data.table - very fast data loading and wrangling
- lubridate - the way to do dates and dateTimes in R
- hms - deals with time (HH:MM:SS)
- ggplot2 - plots, especially using geom_tile() with date on the x axis, time of day on the y and 'fill' set to the sensor value that should be there. This shows up non-random (and random) data holes like these very nicely.
This repo is an R package. This means:
- package functions are kept in /R
- help files auto-created by roxygen are in /man
- if you clone it you can build it and use the functions
- Rmd scripts for reporting the results of drake plans are in /Rmd
- outputs are kept in /docs (reports, plots etc)
- you (and we) keep your data out of it!
We'd love your contributions - feel free to:
- fork & go
- make a new branch in your fork
- make some improvements
- send us a pull request (just code, no data please, keep your data elsewhere!)