Return to course site

Power Analysis

Introduction to Power Analysis

Power analysis is a set of methods for forecasting and understanding the Type-II error rate of an analysis. Most of the time, if we are doing inferential statistics on a data set using a traditional test such as a t-test, we care about the \(p\) value, which estimates the Type I error, or false alarm rate. This reduces the chance of concluding we have a significant difference when one really doesn’t exist. Standard inferential tests ignore Type II errors completely–the chance of failing to find a significant result if it actually does exist. This often makes sense, because so much of experimental science works in the direction of inflating Type I error. For example, if you run an experiment and it fails to reach a p=.05 criterion and thus fails to support your favorite hypothesis, you might collect more observations or subjects and see if that is really the case. It is not hard to see how this can lead to mistaken outcomes.

If you have a null result, you might instead want to know how likely this finding was. It could have failed for several reasons:

  1. It might have failed because the effect really exists and you messed up something, a measure is unreliable, coded incorrectly, etc.
  2. It might have failed because the effect does not exist.
  3. It might have failed because the effect really exists but you did not collect enough data.

If possibility 1 is true, then we need to use our knowledge within the discipline to improve the situation. But how can we destinguish between #2 and #3? One of the tools for this is power analysis, referring to mtehods to deal with, estimate, or control Type II error. Most resources cite a book by Cohen (1988) as the comprehensive source on this concept:

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale,NJ: Lawrence Erlbaum.

Example

Suppose we have a phenomena with true but small between-group difference. We collect 100 observations in each of two groups.


    Welch Two Sample t-test

data:  groupA and groupB
t = -3.926, df = 186.92, p-value = 0.0001213
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.7636017 -0.2528548
sample estimates:
mean of x mean of y
 10.00291  10.51114 

This effect is very strong, and we can detect in with a large experiment. But suppose collecting data is very expensive, and you could only collect 10 in each group:


    Welch Two Sample t-test

data:  groupA[1:10] and groupB[1:10]
t = -1.3939, df = 14.98, p-value = 0.1837
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.1918662  0.2494373
sample estimates:
mean of x mean of y
 9.982043 10.453257 

So, this test was not significant; even though the means were actually very close to the true means of the groups. We would like to know how likely we would have been to find a null effect in an experiment this size. We can use the pwr library to do this. All of the pwr functions take at least four arguments:

  • sample size
  • effect size
  • significant level (p-value; Type-I error rate)
  • power (1 minus Type-II error rate)

Estimating power

Some of the specific power functions will also take other arguments, such as whether the test is one-sided or two-sided. In this case, since we conducted the experiment already, let’s try to estimate the power. To do so, we need to compute the effect size. In the case of a t-test, we use Cohen’s d–the standardized difference between means, or \(\delta/sd\). The pwr library by Stephane Champely will do many power calculations for you, although there are many on-line tools available and other custom software available in other packages.


     t test power calculation

             n1 = 10
             n2 = 10
              d = 0.6404344
      sig.level = 0.05
          power = 0.2734475
    alternative = two.sided

     Two-sample t test power calculation

              n = 10
              d = 0.6404344
      sig.level = 0.05
          power = 0.2734475
    alternative = two.sided

NOTE: n is number in *each* group

We can specifiy the power analysis with either of these functions, where n is the number in each group. Here, we can see that with a .05 signiificance level and the results we found, we’d expect a power of .25. This is a 75% chance of making a Type-II error. That is, in the experiment we ran, 75% percent of the time that there was a true difference of the size we measured, we’d expect that we would fail to find it.

Estimating sample size

We might decide that we need to run another experiment based on this one. Supposing we have the same effect size, how many subjects would we need? The rule of thumb for power analysis is typically that we seek to have a test with power of 0.8–we want an 80% chance of finding the effect if it really is there, with a p-value of .05–a 5% chance of finding an effect that is not there. The pwr.t2n.test won’t work well, because it lets N1 and N2 differ, so we ca only use pwr.t.test:


     Two-sample t test power calculation

              n = 39.25641
              d = 0.6404344
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

This suggests that we really need to collect 44 participants in each group to stand a good chance of finding the result. If this study cost $200 per subject, we have just determined that it will cost $9,000 to run the study, which may be out of our budget and thus not worthing doing. Instead, we might take a look at our measures and try to find ways to produce larger effect sizes; maybe via a within-subject design or with a more reliable set of measures (like with double the number of observations or items).

Estimating Effect size

So we have determined that our experiment won’t work. Suppose that we have enough money to run 20 subjects in each group in our new experiments.


     Two-sample t test power calculation

              n = 20
              d = 0.9091587
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

This is good news. If we can find a way to cut the variability of our test roughly in half and increase the effect size to .9, we would be able to find the effect with 20 participants.

Estimating p-value

Another way to look at it is to ask about the type-I error rate fixing the other parameters. This might be more appropriate in practical non-scientific settings, where you need to conduct a study to make a decision, but your managers have determined that the amount you can spend on the test is limited because it has to be paid for by the amount you expect to benefit by from choosing the better design (interface, method, device, product, food, etc.). This would not be acceptable in a scientific context, but it might be acceptable for A/B testing of web sites or products.


     Two-sample t test power calculation

              n = 20
              d = 0.6
      sig.level = 0.2946697
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

Here, we can see that the significance level is .29, which is actually about the same \((1-power)\). If we split the difference, we can see they are about equal:


     Two-sample t test power calculation

              n = 20
              d = 0.6
      sig.level = 0.2391074
          power = 0.76
    alternative = two.sided

NOTE: n is number in *each* group

So, if we know we have an effect size of .6 and can only afford to test 20 people in each group, then by picking a p-value of .24, we have a power of .76. Here Type I and Type II error rates are essentially equal. This means that if one option is better, we will see it 3/4 of the time and if the options don’t differ, we will see no difference 3/4 of the time. If you could further quantify the costs and benefits of each type of error, you could make other decisions that will optimize the test design.

Bayes factor analysis
--------------
[1] Alt., r=0.707 : 173.0584 ±0%

Against denominator:
  Null, mu1-mu2 = 0
---
Bayes factor type: BFindepSample, JZS
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.7746119 ±0%

Against denominator:
  Null, mu1-mu2 = 0
---
Bayes factor type: BFindepSample, JZS

Overview of power calculations

Power can be computed for a number of tests

pwr-package Basic Functions for Power Analysis pwr
cohen.ES    Conventional effects size
ES.h    Effect size calculation for proportions
ES.w1   Effect size calculation in the chi-squared test for goodness of fit
ES.w2   Effect size calculation in the chi-squared test for association
plot.power.htest    Plot diagram of sample size vs. test power
pwr Basic Functions for Power Analysis pwr
pwr.2p.test Power calculation for two proportions (same sample sizes)
pwr.2p2n.test   Power calculation for two proportions (different sample sizes)
pwr.anova.test  Power calculations for balanced one-way analysis of variance tests
pwr.chisq.test  power calculations for chi-squared tests
pwr.f2.test Power calculations for the general linear model
pwr.norm.test   Power calculations for the mean of a normal distribution (known variance)
pwr.p.test  Power calculations for proportion tests (one sample)
pwr.r.test  Power calculations for correlation test
pwr.t.test  Power calculations for t-tests of means (one sample, two samples and paired samples)
pwr.t2n.test    Power calculations for two samples (different sizes) t-tests of means

Power for t-tests

The above examples show how to calculate power fo independent-sample t-tests, either with equal or unequal numbers of groups. If you have a paired-samples or one-sample t-tests, you can specify the type argument to make those calculations:


     Paired t test power calculation

              n = 50
              d = 0.4
      sig.level = 0.05
          power = 0.7917872
    alternative = two.sided

NOTE: n is number of *pairs*

     Two-sample t test power calculation

              n = 50
              d = 0.4
      sig.level = 0.05
          power = 0.5081857
    alternative = two.sided

NOTE: n is number in *each* group

     One-sample t test power calculation

              n = 50
              d = 0.4
      sig.level = 0.05
          power = 0.7917872
    alternative = two.sided

Power of correlations

For correlations, the r value IS the effect size. In many personality experiments, inter-scale correlations will have a maximum correlation of about 0.3. How many participants would we need to measure if this were true?


     approximate correlation power calculation (arctangh transformation)

              n = 84.07364
              r = 0.3
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

This is a lot larger than many people’s intuition. The plot shows that with if we had collected only 50 participants, the power dips to close to .5. What if we expect a slightly smaller correlation of 0.25?


     approximate correlation power calculation (arctangh transformation)

              n = 122.4466
              r = 0.25
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

Suppose we had a study of 1000 participants and found a correlation of about +.05, which was not significant. What was the power to detect true differences of this size in our experiment?


    Pearson's product-moment correlation

data:  x and y
t = 1.6173, df = 998, p-value = 0.1061
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.01089842  0.11276487
sample estimates:
       cor
0.05112921 

     approximate correlation power calculation (arctangh transformation)

              n = 1000
              r = 0.05
      sig.level = 0.106
          power = 0.48653
    alternative = two.sided

Power in One-way ANOVA and F tests

For general ANOVA tests that we might use in regression or ANOVA, the pwr.f2.test or pwr.anova.test are used.

The effect size these tests use is \(f^2\), where \(f^2 = {{R^2}\over {1-R^2}}\) (don’t be confused with the actual \(F\) value.) To use this, we need to know the degrees of freedom associated with the test.

Notice that if \(R^2\) is very high, you \(f^2\) will be very high. If \(R^2\)=.5, \(f^2=1.0\). The pwr library includes some lookup functions to help you judge what might be considered a large versus small effect size:


     Conventional effect size from Cohen (1982)

           test = f2
           size = small
    effect.size = 0.02

     Conventional effect size from Cohen (1982)

           test = f2
           size = medium
    effect.size = 0.15

     Conventional effect size from Cohen (1982)

           test = f2
           size = large
    effect.size = 0.35

Here, size is relative, because an f2 of .35 would be an R^2 of around .25, which is a correlation of around .5.

So, suppose we had an ANOVA/ F test with 4 conditions (maybe a 2x2), and 100 total participants. This would be reported as F(3,96), which specified our degrees of freedom.

Call:
   aov(formula = out ~ group, data = dat)

Terms:
                   group Residuals
Sum of Squares    6.1687  102.3679
Deg. of Freedom        3        96

Residual standard error: 1.032634
Estimated effects may be unbalanced
            Df Sum Sq Mean Sq F value Pr(>F)
group        3   6.17   2.056   1.928   0.13
Residuals   96 102.37   1.066               

Results are a bit ambiguous–the p-value is 0.13–not really strong evidence for a lack of an effect. To calculate \(f^2\), we could either make an lm or do it by hand:

[1] 0.05683526
[1] 0.06026016

Now, f2 is .06. How many participants would we have to have run to find a significant effect?


     Multiple regression power calculation

              u = 3
              v = 180.7598
             f2 = 0.0603
      sig.level = 0.05
          power = 0.8

v=180–almost twice as many!

Chi-squared (\(\chi^2\) ) tests

A \(\chi^2\) test compares two sets of proportions. The effect size it uses is w, which can be computed using the ES.w2 function. Suppose we observed a set of rolls of dice (300 rolls), and we suspect that the die may be biased. How many times would we have to roll it to be confident it was fair?

[1] 0.1522035

The effect size is about .15


     Chi squared power calculation

              w = 0.15
              N = 570.1158
             df = 5
      sig.level = 0.05
          power = 0.8

NOTE: N is the number of observations

This suggests that for a die with a bias that we observed, we’d need N=571+ to detect the bias 80% of the time, and probably around 1000 rolls to detect it 95% of the time. This might be worthwhile if you were a casino trying to ensure there is no small bias in the dice you purchase that could be used by a player to gain an advantage against the house.

Proportions

If you have accuracy data or some proportion, you might want to do a power test to see how large of a sample you need to find a difference. For example, if you have an accuracy of 75%, compared to one of 80%. You use the pwr.2p.test for this, but need to use ES.h to calculate the effect size h.

[1] 0.1199023

     Difference of proportion power calculation for binomial distribution (arcsine transformation)

              h = 0.1199023
              n = 1091.896
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: same sample sizes

So, 5% difference might require more than 1000 observations per group to be sure to find. Most two-category data sets will obey this principle. This includes tests of whether gender differs by major, conversion rate in A/B testing, etc.

Other power calculations

the pwr package includes several other power calculation functions that are useful in some particular situations, but we won’t otherwise cover here.

Shane T. Mueller shanem@mtu.edu

2019-01-20