1 About

1.1 License

This work is (c) the author(s).

License This work is licensed under a Creative Commons Attribution 4.0 International License unless otherwise marked.

For the avoidance of doubt and explanation of terms please refer to the full license notice and legal code.

1.2 Citation

If you wish to use any of the material from this paper please cite as:

  • Ben Anderson, Tom Rushby, Abubakr Bahaj and Patrick James. (2019) Statistical Power, Statistical Significance, Study Design and Decision Making: A Worked Example (Sizing Demand Response Trials in New Zealand), Southampton: University of Southampton.

This work is (c) 2019 the authors.

1.3 History

Code & report history:

1.4 Data:

This report uses circuit level extracts for ‘Heat Pumps’ from the NZ GREEN Grid Household Electricity Demand Data (https://dx.doi.org/10.5255/UKDA-SN-853334 (Anderson et al. 2018)). These have been extracted using the code found in https://github.com/CfSOtago/GREENGridData/blob/master/examples/code/extractCleanGridSpy1minCircuit.R

1.5 Acknowledgements

This work was supported by:

2 Introduction

This report contains the analysis for a paper of the same name. The text is stored elsewhere for ease of editing.

3 Error, power, significance and decision making

4 Sample design: statistical power

4.1 Means

Table 4.1: Summary of loaded grid spy data
hhID linkID r_dateTime circuit powerW
Length:14250284 Length:14250284 Min. :2015-04-01 00:00:00 Length:14250284 Min. : -655.00
Class :character Class :character 1st Qu.:2015-06-22 12:39:00 Class :character 1st Qu.: 0.00
Mode :character Mode :character Median :2015-09-16 13:12:00 Mode :character Median : 0.00
NA NA Mean :2015-09-21 08:00:39 NA Mean : 147.92
NA NA 3rd Qu.:2015-12-17 17:52:00 NA 3rd Qu.: 61.29
NA NA Max. :2016-03-31 23:59:00 NA Max. :27759.00

Notice that there are negawatts! Remove rf_46 and all negative values as per https://cfsotago.github.io/GREENGridData/gridSpy1mOutliersReport_v1.0.html

Table 4.2: Summary of cleaned grid spy data
hhID linkID r_dateTime circuit powerW month year tmpM season
Length:13298965 Length:13298965 Min. :2015-04-01 00:00:00 Length:13298965 Min. : 0.0 Min. : 1.000 Min. :2015 Min. : 1.000 Spring:3351249
Class :character Class :character 1st Qu.:2015-06-20 15:32:00 Class :character 1st Qu.: 0.0 1st Qu.: 4.000 1st Qu.:2015 1st Qu.: 4.000 Summer:2875049
Mode :character Mode :character Median :2015-09-14 20:06:00 Mode :character Median : 0.0 Median : 7.000 Median :2015 Median : 7.000 Autumn:3471128
NA NA Mean :2015-09-19 21:24:45 NA Mean : 152.0 Mean : 6.581 Mean :2015 Mean : 6.581 Winter:3601539
NA NA 3rd Qu.:2015-12-16 12:26:00 NA 3rd Qu.: 50.8 3rd Qu.: 9.000 3rd Qu.:2015 3rd Qu.: 9.000 NA
NA NA Max. :2016-03-31 23:59:00 NA Max. :27759.0 Max. :12.000 Max. :2016 Max. :12.000 NA

Number of households in cleaned heatpump data: 28

## Loading: /Volumes/hum-csafe/Research Projects/GREEN Grid/Packaged Data for Sharing Externally/ReShare/reshare_v1.0/ggHouseholdAttributesSafe.csv.zip
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   linkID = col_character(),
##   hasApplianceSummary = col_character(),
##   Oven = col_character(),
##   `Fridge / Freezer 1` = col_character(),
##   `Fridge / Freezer 2` = col_character(),
##   `Fridge / Freezer 3` = col_character(),
##   Dishwasher = col_character(),
##   Microwave = col_character(),
##   `Washing Machine` = col_character(),
##   `Clothes Dryer` = col_character(),
##   `Hot water cylinder` = col_character(),
##   `Other Appliance` = col_character(),
##   `Electric heater` = col_character(),
##   `Heated towel rails` = col_character(),
##   `PV Inverter` = col_character(),
##   `Energy Storage` = col_character(),
##   `Other Generation Device` = col_logical(),
##   hasLongSurvey = col_character(),
##   Q19_2 = col_logical(),
##   Q19_5 = col_logical()
##   # ... with 10 more columns
## )
## See spec(...) for full column specifications.
(#tab:load household data)Summary of mean consumption per household by season
season nPeople meanMeanW sdMeanW nHouseholds
Spring NA 595.994212 443.635765 2
Spring 1 92.230234 103.648048 2
Spring 2 89.339624 44.338145 4
Spring 3 210.076391 187.625482 6
Spring 4+ 175.856103 148.738840 11
Summer 1 4.019881 3.746534 2
Summer 2 35.275766 61.099420 3
Summer 3 86.328405 145.661285 6
Summer 4+ 33.637416 74.408925 10
Autumn NA 387.203399 316.302379 2
Autumn 1 70.587984 79.862519 2
Autumn 2 73.233719 56.284769 4
Autumn 3 245.460272 209.918748 7
Autumn 4+ 199.479290 165.371666 13
Winter NA 661.964787 275.647550 2
Winter 1 169.532436 213.880258 2
Winter 2 282.138922 71.265180 4
Winter 3 476.930850 302.869555 7
Winter 4+ 413.121623 279.067726 12
(#tab:load household data)Summary of mean consumption per household in winter
meanMeanW sdMeanW nHouseholds
410.6491 269.1503 27

Observations are summarised to mean W per household during 16:00 - 20:00 on weekdays for year = 2015.

Figure 4.1 shows the initial p = 0.01 plot.

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
Power analysis results (p = 0.01, power = 0.8)

Figure 4.1: Power analysis results (p = 0.01, power = 0.8)

## Saving 7 x 5 in image

Effect size at n = 1000: 9.29.

Figure 4.2 shows the plot for all results.

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
Power analysis results (power = 0.8)

Figure 4.2: Power analysis results (power = 0.8)

## Saving 7 x 5 in image

At same effect size (9.292105%, n = 1000, p = 0.01):

  • p = 0.05, n = 600
  • p = 0.1, n = 425
  • p = 0.2, n = 275

Full table of results:

## Using 'effectSize' as value column. Use 'value.var' to override
Table 4.3: Power analysis for means results table (partial)
sampleN p = 0.01 p = 0.05 p = 0.1 p = 0.2
50 42.11 32.82 27.95 22.10
100 29.57 23.13 19.72 15.62
150 24.09 18.86 16.09 12.75
200 20.83 16.32 13.93 11.04
250 18.62 14.60 12.46 9.87
300 16.99 13.32 11.37 9.01
350 15.73 12.33 10.53 8.34
400 14.71 11.53 9.85 7.80
450 13.86 10.87 9.28 7.36
500 13.15 10.31 8.80 6.98
550 12.54 9.83 8.39 6.65
600 12.00 9.41 8.04 6.37
650 11.53 9.04 7.72 6.12
700 11.11 8.72 7.44 5.90
750 10.73 8.42 7.19 5.70
800 10.39 8.15 6.96 5.52
850 10.08 7.91 6.75 5.35
900 9.80 7.69 6.56 5.20
950 9.53 7.48 6.39 5.06
1000 9.29 7.29 6.22 4.93

4.2 Proportions

Does not require a sample. As a relatively simple example, suppose we were interested in the adoption of heat pumps in two equal sized samples. Suppose we thought in one sample (say, home owners) we thought it might be 40% and in rental properties it would be 25% (ref BRANZ 2015). What sample size would we need to conclude a significant difference with power = 0.8 and at various p values?

pwr::pwr.tp.test() (ref pwr) can give us the answer…

Table 4.4: Samples required if p1 = 40% and p2 = 25%
n sig.level power props
224.94 0.01 0.8 p1 = 0.4 p2 = 0.25
151.17 0.05 0.8 p1 = 0.4 p2 = 0.25
119.07 0.10 0.8 p1 = 0.4 p2 = 0.25
86.73 0.20 0.8 p1 = 0.4 p2 = 0.25

We can repeat this for other values of p1 and p2. For example, suppose both were much smaller (e.g. 10% and 15%)… Clearly we need much larger samples.

Table 4.5: Samples required if p1 = 10% and p2 = 15%
n sig.level power props
1012.35 0.01 0.8 p1 = 0.1 p2 = 0.15
680.35 0.05 0.8 p1 = 0.1 p2 = 0.15
535.89 0.10 0.8 p1 = 0.1 p2 = 0.15
390.31 0.20 0.8 p1 = 0.1 p2 = 0.15

The above used an arcsine transform.

As a double check, using eqn to assess margin of error…

\[me = +/- z * \sqrt{\frac{p(1-p)} {n-1}}\]

If:

  • p = 0.4 (40%)
  • n = 151

then the margin of error = +/- 0.078 (7.8%). So we could quote the Heat Pump uptake for owner-occupiers as 40% (+/- 7.8% [or 32.2 - 47.8] with p = 0.05).

This may be far too wide an error margin for our purposes so we may instead have recruited 500 per sample. Now the margin of error is +/- 0.043 (4.3%) so we can now quote the Heat Pump uptake for owner-occupiers as 40% (+/- 4.3% [or 35.7 - 44.3] with p = 0.05).

5 Testing for differences: effect sizes, confidence intervals and p values

5.1 Getting it ‘wrong’

Use base GREENGrid and number of people but re-sample slightly.

NB 1: we create a small sample roughly 2 * the size of the GREEN Grid data. Due to small number effects and the random re-sampling with replacement process, there will be random fluctuations in the results with each run. As a consequence the results in this section will probably not match the results in the paper…

NB 2: sometimes the small-n random process doesn’t create households of a given type that we then need for analysis. Don’t worry, just re-knit until it does :-)

Table 5.1: Number of households and summary statistics per group (winter heat pump use)
nPeople mean W sd W n households
1 260.2741 135.26975 5
2 243.5189 30.20443 7
3 549.5341 319.97555 9
4+ 432.8072 279.30838 29

So a sample of 50.

T test 1 <-> 3

Table 5.2: T test results (1 vs 3)
1 person mean 3 persons mean Mean difference statistic p.value conf.low conf.high
260.2741 549.5341 -289.2599 -2.358998 0.036821 -557.5082 -21.01171

The results show that the mean power demand for the control group was 549.53W and for Intervention 1 was 260.27W. This is a (very) large difference in the mean of 289.26. The results of the t test are:

  • effect size = 289W or 53% representing a substantial bang for buck for whatever caused the difference;
  • 95% confidence interval for the test = -557.51 to -21.01 representing considerable uncertainty/variation;
  • p value of 0.037 representing a relatively low risk of a false positive result but which (just) fails the conventional p < 0.05 threshold.

T test 1 <-> 4+

Table 5.3: T test results (1 vs 4+)
1 person mean 4+ persons mean Mean difference statistic p.value conf.low conf.high
260.2741 432.8072 -172.5331 -2.165191 0.0528381 -347.5762 2.509981

Now:

  • effect size = 173W or 39.86% representing a still reasonable bang for buck for whatever caused the difference;
  • 95% confidence interval for the test = -347.58 to 2.51 representing even greater uncertainty/variation;
  • p value of 0.053 representing a higher risk of a false positive result which fails the conventional p < 0.05 threshold and also the less conservative p < 0.1.

5.2 Getting it ‘right’

NB: we create a larger sample roughly 40 * the size of the GREEN Grid data. Due to the random re-sampling with replacement process, there will be random fluctuations in the results with each run. Due to small number effects and the random re-sampling with replacement process, there will be random fluctuations in the results with each run. As a consequence the results in this section will probably not exactly match the results in the paper but as the new sample is large they should be quite close…

Table 5.4: Number of households and summary statistics per group
nPeople mean W sd W n households
1 171.1944 152.06488 91
2 285.4655 60.94574 149
3 474.2346 274.00976 297
4+ 435.3399 270.75666 463

So n = 1000

Mean W demand per group for large sample (Error bars = 95% confidence intervals for the sample mean)

Figure 5.1: Mean W demand per group for large sample (Error bars = 95% confidence intervals for the sample mean)

re-run T tests 1 vs 3

Table 5.5: T test results (1 vs 3)
1 person mean 3 persons mean Mean difference statistic p.value conf.low conf.high
171.1944 474.2346 -303.0403 -13.45974 0 -347.3629 -258.7177

In this case:

  • effect size = 303.0402585W or 63.9% representing a still reasonable bang for buck for whatever caused the difference;
  • 95% confidence interval for the test = -347.36 to -258.72 representing much less uncertainty/variation;
  • p value of 0 representing a very low risk of a false positive result as it passes all conventional thresholds.

re-run T tests 1 person vs 4+

Table 5.6: T test results (1 vs 4+)
1 person mean 4+ persons mean Mean difference statistic p.value conf.low conf.high
171.1944 435.3399 -264.1455 -13.00654 0 -304.1695 -224.1215

In this case:

  • effect size = 264.1454873W or 60.68% representing a still reasonable bang for buck for whatever caused the difference;
  • 95% confidence interval for the test = -304.17 to -224.12 representing much less uncertainty/variation;
  • p value of 0 representing a very low risk of a false positive result as it passes all conventional thresholds.

6 Summary and recommendations

6.1 Statistical power and sample design

6.2 Reporting statistical tests of difference (effects)

6.3 Making inferences and taking decisions

7 Acknowledgments

8 Runtime

Analysis completed in 55.18 seconds ( 0.92 minutes) using knitr in RStudio with R version 3.5.1 (2018-07-02) running on x86_64-apple-darwin15.6.0.

9 R environment

R packages used:

Session info:

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.21         GREENGridData_1.0  pwr_1.2-2         
##  [4] forcats_0.4.0      broom_0.5.1        lubridate_1.7.4   
##  [7] readr_1.3.1        ggplot2_3.1.0      dplyr_0.8.0.1     
## [10] data.table_1.12.0  weGotThePower_0.1  dkUtils_0.0.0.9000
## [13] bookdown_0.9       markdown_0.9      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0        highr_0.7         cellranger_1.1.0 
##  [4] compiler_3.5.1    pillar_1.3.1      plyr_1.8.4       
##  [7] prettyunits_1.0.2 progress_1.2.0    tools_3.5.1      
## [10] digest_0.6.18     lattice_0.20-38   nlme_3.1-137     
## [13] evaluate_0.13     tibble_2.0.1      gtable_0.2.0     
## [16] pkgconfig_2.0.2   rlang_0.3.1       yaml_2.2.0       
## [19] xfun_0.4          withr_2.1.2       stringr_1.4.0    
## [22] generics_0.0.2    hms_0.4.2         grid_3.5.1       
## [25] tidyselect_0.2.5  glue_1.3.0        R6_2.4.0         
## [28] readxl_1.3.0      rmarkdown_1.11    tidyr_0.8.2      
## [31] reshape2_1.4.3    purrr_0.3.0       magrittr_1.5     
## [34] ellipsis_0.0.2    backports_1.1.3   scales_1.0.0     
## [37] htmltools_0.3.6   assertthat_0.2.0  colorspace_1.4-0 
## [40] labeling_0.3      stringi_1.3.1     lazyeval_0.2.1   
## [43] munsell_0.5.0     crayon_1.3.4

References

Anderson, Ben, David Eyers, Rebecca Ford, Diana Giraldo Ocampo, Rana Peniamina, Janet Stephenson, Kiti Suomalainen, Lara Wilcocks, and Michael Jack. 2018. “New Zealand GREEN Grid Household Electricity Demand Study 2014-2018,” September. doi:10.5255/UKDA-SN-853334.

Champely, Stephane. 2018. Pwr: Basic Functions for Power Analysis. https://CRAN.R-project.org/package=pwr.

Csárdi, Gábor, and Rich FitzJohn. 2016. Progress: Terminal Progress Bars. https://CRAN.R-project.org/package=progress.

Dowle, M, A Srinivasan, T Short, S Lianoglou with contributions from R Saporta, and E Antonyan. 2015. Data.table: Extension of Data.frame. https://CRAN.R-project.org/package=data.table.

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. http://www.jstatsoft.org/v40/i03/.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

Wickham, Hadley, and Romain Francois. 2016. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, Jim Hester, and Romain Francois. 2016. Readr: Read Tabular Data. https://CRAN.R-project.org/package=readr.

Xie, Yihui. 2016. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://CRAN.R-project.org/package=knitr.