1 About

1.1 License

This work is (c) the author(s).

License This work is licensed under a Creative Commons Attribution 4.0 International License unless otherwise marked.

For the avoidance of doubt and explanation of terms please refer to the full license notice and legal code.

1.2 Citation

If you wish to use any of the material from this paper please cite as:

  • Ben Anderson and Tom Rushby. (2018) Statistical Power, Statistical Significance, Study Design and Decision Making: A Worked Example (Sizing Demand Response Trials in New Zealand), Southampton: University of Southampton.

This work is (c) 2018 the authors.

1.3 History

Code & report history:

1.4 Data:

This report uses circuit level extracts for ‘Heat Pumps’ from the NZ GREEN Grid Household Electricity Demand Data (https://dx.doi.org/10.5255/UKDA-SN-853334 (Anderson et al. 2018)). These have been extracted using the code found in https://github.com/CfSOtago/GREENGridData/blob/master/examples/code/extractCleanGridSpy1minCircuit.R

1.5 Acknowledgements

This work was supported by:

2 Introduction

This report contains the analysis for a paper of the same name. The text is stored elsewhere for ease of editing.

3 Error, power, significance and decision making

4 Sample design: statistical power

4.1 Means

Table 4.1: Summary of mean consumption per household by season
season nPeople meanMeanW sdMeanW nHouseholds
Spring NA 595.994212 443.635765 2
Spring 1 92.230234 103.648048 2
Spring 2 89.339624 44.338145 4
Spring 3 207.619377 171.401166 7
Spring 4+ 175.856103 148.738840 11
Summer 1 4.019881 3.746534 2
Summer 2 35.275766 61.099420 3
Summer 3 87.760306 133.023910 7
Summer 4+ 33.637416 74.408925 10
Autumn NA 387.203399 316.302379 2
Autumn 1 70.587984 79.862519 2
Autumn 2 73.233719 56.284769 4
Autumn 3 245.971947 194.352385 8
Autumn 4+ 199.479290 165.371666 13
Winter NA 661.964787 275.647550 2
Winter 1 169.532436 213.880258 2
Winter 2 282.138922 71.265180 4
Winter 3 475.616350 280.427370 8
Winter 4+ 413.121623 279.067726 12
Table 4.1: Summary of mean consumption per household in winter
meanMeanW sdMeanW nHouseholds
412.6407 264.3291 28

Observations are summarised to mean W per household during 16:00 - 20:00 on weekdays for year = 2015.

## Warning: replacing previous import 'data.table::melt' by 'reshape2::melt'
## when loading 'weGotThePower'
## Warning: replacing previous import 'data.table::dcast' by 'reshape2::dcast'
## when loading 'weGotThePower'

Figure 4.1 shows the initial p = 0.01 plot.

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
Power analysis results (p = 0.01, power = 0.8)

Figure 4.1: Power analysis results (p = 0.01, power = 0.8)

## Saving 7 x 5 in image

Effect size at n = 1000: 9.08.

Figure 4.2 shows the plot for all results.

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
Power analysis results (power = 0.8)

Figure 4.2: Power analysis results (power = 0.8)

## Saving 7 x 5 in image

At same effect size (9.0816159%, n = 1000, p = 0.01):

  • p = 0.05, n = 575
  • p = 0.1, n = 425
  • p = 0.2, n = 250

Full table of results:

## Using 'effectSize' as value column. Use 'value.var' to override
Table 4.2: Power analysis for means results table (partial)
sampleN p = 0.01 p = 0.05 p = 0.1 p = 0.2
50 41.16 32.08 27.32 21.60
100 28.90 22.60 19.27 15.26
150 23.54 18.43 15.73 12.46
200 20.36 15.95 13.61 10.79
250 18.20 14.27 12.17 9.65
300 16.61 13.02 11.11 8.81
350 15.37 12.05 10.29 8.15
400 14.37 11.27 9.62 7.63
450 13.55 10.63 9.07 7.19
500 12.85 10.08 8.61 6.82
550 12.25 9.61 8.20 6.50
600 11.73 9.20 7.86 6.23
650 11.27 8.84 7.55 5.98
700 10.86 8.52 7.27 5.76
750 10.49 8.23 7.03 5.57
800 10.16 7.97 6.80 5.39
850 9.85 7.73 6.60 5.23
900 9.57 7.51 6.41 5.08
950 9.32 7.31 6.24 4.95
1000 9.08 7.13 6.08 4.82

4.2 Proportions

Does not require a sample. As a relatively simple example, suppose we were interested in the adoption of heat pumps in two equal sized samples. Suppose we thought in one sample (say, home owners) we thought it might be 40% and in rental properties it would be 25% (ref BRANZ 2015). What sample size would we need to conclude a significant difference with power = 0.8 and at various p values?

pwr::pwr.tp.test() (ref pwr) can give us the answer…

Table 4.3: Samples required if p1 = 40% and p2 = 25%
n sig.level power props
224.94 0.01 0.8 p1 = 0.4 p2 = 0.25
151.17 0.05 0.8 p1 = 0.4 p2 = 0.25
119.07 0.10 0.8 p1 = 0.4 p2 = 0.25
86.73 0.20 0.8 p1 = 0.4 p2 = 0.25

We can repeat this for other values of p1 and p2. For example, suppose both were much smaller (e.g. 10% and 15%)… Clearly we need much larger samples.

Table 4.4: Samples required if p1 = 10% and p2 = 15%
n sig.level power props
1012.35 0.01 0.8 p1 = 0.1 p2 = 0.15
680.35 0.05 0.8 p1 = 0.1 p2 = 0.15
535.89 0.10 0.8 p1 = 0.1 p2 = 0.15
390.31 0.20 0.8 p1 = 0.1 p2 = 0.15

The above used an arcsine transform.

As a double check, using eqn to assess margin of error…

\[me = +/- z * \sqrt{\frac{p(1-p)} {n-1}}\]

If:

  • p = 0.4 (40%)
  • n = 151

then the margin of error = +/- 0.078 (7.8%). So we could quote the Heat Pump uptake for owner-occupiers as 40% (+/- 7.8% [or 32.2 - 47.8] with p = 0.05).

This may be far too wide an error margin for our purposes so we may instead have recruited 500 per sample. Now the margin of error is +/- 0.043 (4.3%) so we can now quote the Heat Pump uptake for owner-occupiers as 40% (+/- 4.3% [or 35.7 - 44.3] with p = 0.05).

5 Testing for differences: effect sizes, confidence intervals and p values

5.1 Getting it ‘wrong’

Use base GREENGrid and number of people but re-sample slightly.

NB: we create a small sample roughly 2 * the size of the GREEN Grid data. Due to small numbers and the random re-sampling with replacement process, there will be random fluctuations in the results with each run. As a result the results in this section will probably not match the results in the paper…

Table 5.1: Number of households and summary statistics per group (winter heat pump use)
nPeople mean W sd W n households
1 147.9273 161.6783 7
2 301.9291 76.8570 7
3 429.2748 248.5965 14
4+ 470.3224 297.9899 24

So a sample of 52.

T test 1 <-> 3

Table 5.2: T test results (1 vs 3)
1 person mean 3 persons mean Mean difference statistic p.value conf.low conf.high
147.9273 429.2748 -281.3475 -3.116754 0.0061527 -471.4924 -91.20272

The results show that the mean power demand for the control group was 429.27W and for Intervention 1 was 147.93W. This is a (very) large difference in the mean of 281.35. The results of the t test are:

  • effect size = 281W or 66% representing a substantial bang for buck for whatever caused the difference;
  • 95% confidence interval for the test = -471.49 to -91.2 representing considerable uncertainty/variation;
  • p value of 0.006 representing a relatively low risk of a false positive result but which (just) fails the conventional p < 0.05 threshold.

T test 1 <-> 4+

Table 5.3: T test results (1 vs 4+)
1 person mean 4+ persons mean Mean difference statistic p.value conf.low conf.high
147.9273 470.3224 -322.3952 -3.739141 0.0013971 -502.9035 -141.8869

Now:

  • effect size = 322W or 68.55% representing a still reasonable bang for buck for whatever caused the difference;
  • 95% confidence interval for the test = -502.9 to -141.89 representing even greater uncertainty/variation;
  • p value of 0.001 representing a higher risk of a false positive result which fails the conventional p < 0.05 threshold and also the less conservative p < 0.1.

5.2 Getting it ‘right’

NB: we create a larger sample roughly 40 * the size of the GREEN Grid data. Due to the random re-sampling with replacement process, there will be random fluctuations in the results with each run. As a result the results in this section will probably not exactly match the results in the paper but as the sample is large they should be quite close…

Table 5.4: Number of households and summary statistics per group
nPeople mean W sd W n households
1 159.2209 151.7489 88
2 285.1232 63.8895 149
3 511.0046 279.3558 308
4+ 417.3538 267.6910 495

So n = 1040

Mean W demand per group for large sample (Error bars = 95% confidence intervals for the sample mean)

Figure 5.1: Mean W demand per group for large sample (Error bars = 95% confidence intervals for the sample mean)

re-run T tests 1 vs 3

Table 5.5: T test results (1 vs 3)
1 person mean 3 persons mean Mean difference statistic p.value conf.low conf.high
159.2209 511.0046 -351.7837 -15.50063 0 -396.4678 -307.0996

In this case:

  • effect size = 351.7837236W or 68.84% representing a still reasonable bang for buck for whatever caused the difference;
  • 95% confidence interval for the test = -396.47 to -307.1 representing much less uncertainty/variation;
  • p value of 0 representing a very low risk of a false positive result as it passes all conventional thresholds.

re-run T tests 1 person vs 4+

Table 5.6: T test results (1 vs 4+)
1 person mean 4+ persons mean Mean difference statistic p.value conf.low conf.high
159.2209 417.3538 -258.1329 -12.80393 0 -297.8882 -218.3776

In this case:

  • effect size = 258.1328841W or 61.85% representing a still reasonable bang for buck for whatever caused the difference;
  • 95% confidence interval for the test = -297.89 to -218.38 representing much less uncertainty/variation;
  • p value of 0 representing a very low risk of a false positive result as it passes all conventional thresholds.

6 Summary and recommendations

6.1 Statistical power and sample design

6.2 Reporting statistical tests of difference (effects)

6.3 Making inferences and taking decisions

7 Acknowledgments

8 Runtime

Analysis completed in 42.5 seconds ( 0.71 minutes) using knitr in RStudio with R version 3.5.1 (2018-07-02) running on x86_64-apple-darwin15.6.0.

9 R environment

R packages used:

  • base R - for the basics (R Core Team 2016)
  • data.table - for fast (big) data handling (Dowle et al. 2015)
  • lubridate - date manipulation (Grolemund and Wickham 2011)
  • ggplot2 - for slick graphics (Wickham 2009)
  • readr - for csv reading/writing (Wickham, Hester, and Francois 2016)
  • dplyr - for select and contains (Wickham and Francois 2016)
  • progress - for progress bars (Csárdi and FitzJohn 2016)
  • knitr - to create this document & neat tables (Xie 2016)
  • pwr - non-base power analysis (Champely 2018)
  • dkUtils - for local dataknut utilities :-) devtools::install_github("dataknut/dkUtils")

Session info:

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.20         GREENGridData_1.0  pwr_1.2-2         
##  [4] forcats_0.3.0      broom_0.5.0        lubridate_1.7.4   
##  [7] readr_1.1.1        ggplot2_3.1.0      dplyr_0.7.7       
## [10] data.table_1.11.8  dkUtils_0.0.0.9000
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.19      highr_0.7         cellranger_1.1.0 
##  [4] pillar_1.3.0      compiler_3.5.1    plyr_1.8.4       
##  [7] bindr_0.1.1       prettyunits_1.0.2 progress_1.2.0   
## [10] tools_3.5.1       digest_0.6.18     lattice_0.20-35  
## [13] nlme_3.1-137      evaluate_0.12     tibble_1.4.2     
## [16] gtable_0.2.0      pkgconfig_2.0.2   rlang_0.3.0.1    
## [19] yaml_2.2.0        xfun_0.4          bindrcpp_0.2.2   
## [22] withr_2.1.2       stringr_1.3.1     hms_0.4.2        
## [25] rprojroot_1.3-2   grid_3.5.1        tidyselect_0.2.5 
## [28] glue_1.3.0        R6_2.3.0          readxl_1.1.0     
## [31] rmarkdown_1.10    bookdown_0.7      weGotThePower_0.1
## [34] reshape2_1.4.3    tidyr_0.8.1       purrr_0.2.5      
## [37] magrittr_1.5      backports_1.1.2   scales_1.0.0     
## [40] htmltools_0.3.6   assertthat_0.2.0  colorspace_1.3-2 
## [43] labeling_0.3      stringi_1.2.4     lazyeval_0.2.1   
## [46] munsell_0.5.0     crayon_1.3.4

References

Anderson, Ben, David Eyers, Rebecca Ford, Diana Giraldo Ocampo, Rana Peniamina, Janet Stephenson, Kiti Suomalainen, Lara Wilcocks, and Michael Jack. 2018. “New Zealand GREEN Grid Household Electricity Demand Study 2014-2018,” September. doi:10.5255/UKDA-SN-853334.

Champely, Stephane. 2018. Pwr: Basic Functions for Power Analysis. https://CRAN.R-project.org/package=pwr.

Csárdi, Gábor, and Rich FitzJohn. 2016. Progress: Terminal Progress Bars. https://CRAN.R-project.org/package=progress.

Dowle, M, A Srinivasan, T Short, S Lianoglou with contributions from R Saporta, and E Antonyan. 2015. Data.table: Extension of Data.frame. https://CRAN.R-project.org/package=data.table.

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. http://www.jstatsoft.org/v40/i03/.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

Wickham, Hadley, and Romain Francois. 2016. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, Jim Hester, and Romain Francois. 2016. Readr: Read Tabular Data. https://CRAN.R-project.org/package=readr.

Xie, Yihui. 2016. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://CRAN.R-project.org/package=knitr.