Skip to content
Snippets Groups Projects
Select Git revision
  • 8e8ba05f3927e4f86393010bc62055d7f1c3da6d
  • master default protected
2 results

datacleaning

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    B.Anderson authored
    8e8ba05f
    History

    dataCleaning

    A place to store hints, tips and examples for data cleaning. We use a lot of very dirty data which often has outliers and missing observations. Since most of this data is large scale 'sensor' data with time stamps we make a lot of use of these R packages to process and visualise the data so we can see what is odd and what is missing:

    • data.table - very fast data loading and wrangling
    • lubridate - the way to do dates and dateTimes in R
    • hms - deals with time (HH:MM:SS)
    • ggplot2 - plots, especially using geom_tile() with date on the x axis, time of day on the y and 'fill' set to the sensor value that should be there. This shows up non-random (and random) data holes like these very nicely.

    This repo is an R package. This means:

    • package functions are kept in /R
    • help files auto-created by roxygen are in /man
    • if you clone it you can build it and use the functions
    • Rmd scripts for reporting the results of drake plans are in /Rmd
    • outputs are kept in /docs (reports, plots etc)
    • you (and we) keep your data out of it!

    We'd love your contributions - feel free to:

    • fork & go
    • make a new branch in your fork
    • make some improvements
    • send us a pull request (just code, no data please, keep your data elsewhere!)