301: Type II Error Control
harmonic n (or the harmonic mean of n)
In order to calculate the power you have if the means of the null and alternate are different Calculating power post hoc in this way is acceptable only if you want to determine the power you actually had to detect a pre-specified effect size. This is not the same as "estimating" power, i.e., calculating what has come to come be called observed power
Power Analysis
two general strategies for analyzing the power of a statistical test: 1) computing a power profile for a given effect size (or a few options) for fixed n and 2) calculating n for a given effect size (or a few options) and desired level of power.
type II error
occur when the researcher observes a value of the test statistic that is "close" to the value implied by the null hypothesis, and so decides to retain that hypothesis, when, in reality, the alternative (or some other) hypothesis is true we never know the true value of the parameter, and so, despite the fact that the null hypothesis is false by some amount, we do not—and never will—know that amount. So, how do we protect against type II errors?
statistical power
The probability that the null hypothesis will be rejected if it is false. A higher statistical power means that we can be more certain that the null hypothesis was not rejected incorrectly or achieve a lower type 1 error
Familywise Power
it is the problem that comes with running multiple statistical tests on the same data set such that the probability of making at least one type II error in the set goes up very quickly as the number of tests increases. Various strategies have been given for dealing with familywise error control
Why is it true that if you reject the null hypothesis, a discussion of power is not important, but if you accept the null hypothesis, a discussion of power is important?
If you reject H0, you either 1) made a correct decision, or 2) made an error. If 1), you either had enough power to detect a true alternative hypothesis, or you got lucky and managed to detect a true alternative hypothesis. If 2) you made a type I error, despite having set alpha to a low value. But, either way, you didn't make a type II error, and power protects against making type II errors. Conversely, if you accept H0, you have either 1) made a correct decision, either because (1-α) was large enough or you got lucky, or 2) you made an error, specifically, a type II error. And it's these types of errors that power is meant to protect against!!
Effect Size cohen's d
What Cohen has done over the course of his work on power is to establish for popularly employed statistical tests guidelines for small, medium, and/or large effect sizes. these are only guidelines and are, admittedly, subjectively determined. Researchers may adopt Cohen's guidelines when they do not have anything else on which to base their effect size specification, but should not do so out of mere convention. Essentially, Cohen has specified "small", "medium", and "large" effect sizes for different types of commonly employed statistical tests and the different effect size estimates which are associated with them
ideal way to play the power analysis
after controlling for type I error (i.e., setting α to a low value), think carefully about the effect sizes you might reasonably expect if the null is, in fact, false, and also about the degree of "insurance" you want to take out against making a type II error (i.e., the desired level of power), and then determine how many participants you will need in your study in order to achieve that desired level of power. If you do these things, you will have designed a good study and can interpret your results without ambiguity this does not mean that you can interpret your results with certainty—it's still, and always will be, a probabilistic game you're playing, and you always have the potential of making an incorrect decision (in either direction). You can only minimize the possibility of doing so if you adequately control for both type I and type II errors.
non-centrality parameters
estimate of "departure from the null-value" that takes in to consideration both effect size and n Once we have computed the non-centrality parameter for the particular statistical test, effect size and n, then we simply look up (in power tables) or calculate (with a computer program) the power corresponding to a given α for the non-centrality value we have calculated
Observed Power
if the goal is to see how an experimenter could improve his or her study, lately post hoc power analysis has been used for a different purpose, specifically as a means of interpreting a non-significant test result This approach advocates using observed parameter estimates as an estimate for the true population effect. researchers will claim that, if the null hypothesis is rejected, and the power of the test (computed post hoc and on the basis of the "observed effect") was high, then the null is probably true if the null was not rejected, and the power of the test was small, that the non-significant results were probably due to the low power, and not to a true null
Power as a Function of α, Effect Size, Sample Size, and Test
power is a function of 1) alpha, 2) effect size, 3) sample size, 4) the particular test to be employed the lower the number of parameters that have to be estimated in order to conduct the test, the more powerful it is. So, for instance, if you are running a between means group test with 5 groups, and you can estimate each of the 5 population variances with a single value (i.e., some "pooled" variance estimate). the effects of α, effect size, and n on power, this results because power is, in a technical sense, a function of all three: α because it establishes when one rejects the null (which is a condition under which power is defined), effect size because, all else being equal, the further the alternative value of the parameter is from the null-value, the higher probability of detecting (i.e., the higher the power), and n for the reasons given above regarding the variability of the sampling distribution of a true alternative distribution for larger n.
Protecting Against Type II Errors
researchers will attempt to control for type II errors by keeping β as small as possible. if we take away β of the distribution, we're left with (1-β). This latter quantity is known as statistical power with β, there is not simply a single (1-β), but an infinity of possible (1-β)'s—one for every possible departure from the value of the parameter that is specified in the null hypothesis. So, there is no generic power. Rather, power = (1-β) is the probability of detecting some particular departure from the null hypothesis when one exists high power implies a low β, i.e., a low probability of committing a type II error over the long run, we desire high power, typically .8 or higher. A power of this magnitude would be interpreted as follows: The probability of detecting a true departure from the value of the parameter specified in the null is .80, or, over repeated sampling a true departure of size d will be detected 80% of the time. And, so, the strategy is to minimize β by maximizing power, and thereby control to the best of our ability type II error. there does not exist "generic" power. So, given that there are an infinity of possible ways that the null hypothesis could be false, there exist an infinity of possible alternative distributions that could be true (but, only one of which will actually be true if the null is false). Given that we can't know which alternative distribution describes the true distribution of the test statistic, in the event that the null hypothesis is false, how do we know how much power our test has and, thus, how well we've protected against making a type II error?
how well we've protected against making a type II error?
the first step is to define and quantify what we mean by "departure from the value specified in the null hypothesis." For each particular hypothesis test, "departure" is defined in terms of what is known as effect size For example, in a test of one sample mean, if we denote the null-value as µ0, and some alternative value as µ1, then one measure of effect size is Cohen's d. d, in this case, is the difference between the null-value of the mean and some other possible value of the mean in standard deviation units. For example, d = .7 indicates that the alternative parameter value differs from the null-value by .7 standard deviations. we often do not have a firm sense of what the effect size should be, especially in exploratory domains and, so, many researchers employ guidelines that have been provided by a statistician, Cohen, who has written extensively on the topic of power and power analysis
1) computing a power profile for a given effect size (or a few options) for fixed n
we conduct a power analysis such that we compute power for some reasonably chosen effect size (or Cohen's guidelines re: small, medium, and large effect sizes), a fixed α and a fixed sample size, n. α is always fixed prior to determining power. This is because, we cannot specify a probability associated with rejecting the null hypothesis when it is false (i.e., power) without first establishing when exactly the null hypothesis is rejected, and the latter is established by fixing α First, we compute for a particular effect size (or for several) a non-centrality parameter, the latter of which is always some function of the effect size and the sample size. Noncentrality parameters (and there will be different types of these parameters for different statistical tests) give us a measure of "distance" between the value of the parameter specified in the null and some possible alternative value, but in units that consider the particular sample size we have. larger the sample size, the lower the variability of the sampling distribution of the test statistic. Less variability typically means that the distribution will "shrink" in toward some central value. areas under the curve corresponding to particular regions on the abscissa will change. And, since areas under the curve correspond to probabilities, then the particular probabilities associated with getting specific ranges of values of the statistic will also change (i.e., α, (1-α), β, (1-β) will change). a small n, the sampling distribution will have, relatively speaking, quite a bit of variability. larger n, the same type of distribution (e.g., a t) will have less variability. Therefore, a given α (e.g., .05) will correspond to different ranges of the statistic (e.g., different values of t). Specifically, for the same α, the t-values corresponding to it will be much further away from the central value in the distribution associated with the smaller n than with the distribution associated with the larger n
n for desired level of power
we use non-centrality parameters. But now, instead of calculating them, we "turn them around" and determine the sample size required in order to adequately protect against making a type II error. we compute the sample size, n, needed for a desired level of power, a given effect size, and a fixed α reverse the equations for the non-centrality parameter, and for each d, and a desired level of power (at least .8), compute n by determining the value of δ corresponding to a desired degree of power and a given α, and filling in d for some reasonable effect size, we can readily determine the sample size required to give a test with the desired level of power
β
β does not occur under the null distribution, but under some other distribution. Specifically, it occurs under the distribution of the test statistic (whatever it may be) given some alternative (i.e., non-null) value of the parameter in question there does not exist one generic β for a given test, but an infinity of βs, due to the fact that, although the null hypothesis specifies but one value for the parameter in question (e.g., µ1 −µ2 = 0), the alternative hypothesis specifies an infinity of possible values that the parameter may take on
determining β
β is defined as the P(type II error) = P(accepting H0|H0 false by some specific amount) if we know the "specific amount" (e.g., by 2.5 raw score units), then we simply determine under the particular alternative distribution of the test statistic that is implied by the "specific amount" of departure the area under the curve associated with making a decision to accept the null hypothesis the boundary between acceptance and rejection remains the same for all alternative distributions (each of which will be slide up or down the abscissa more or less). The point is that there is no generic β, but only particular values of β which correspond to particular departures from the value of the parameter specified in the null hypothesis, only one of which will actually be true if the null hypothesis is false. But, our decision will always be based on the acceptance/reject boundaries which are established once alpha is set