Soc 106 - CH 6 - Significance Tests

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

5 parts of a significance test for a proportion

1. Assumptions Like other tests, this test assumes that the data are obtained using randomization.The sample size must be sufficiently large that the sampling distribution of ˆπ is ap-proximately normal. For the most common case, in which the H0 value of π is 0.50,a sample size of at least 20 is sufficient. 2. Hypotheses The null hypothesis of a test about a population proportion has the form H0: π = π0, such as H0: π = 0.50. Here, π0denotes a particular proportion value between 0 and 1, such as 0.50. The most common alternative hypothesis is Ha: π = π0, such as Ha: π = 0.50. - p. 152 This two-sided alternative states that the population proportion differs from the valuein H0.Theone-sided alternatives Ha: π>π0and Ha: π<π --- apply when the researcher predicts a deviation in a certain direction from the H0 value. 3. Test Statistic From Section 5.2, the sampling distribution of the sample proportion ˆπ has mean πand standard errorπ(1 − π)/n. When H0is true, π = π0, so the standard error isse0=π0(1 − π0)/n. We use the notation - p. 152 This measures the number of standard errors that the sample proportion ˆπ falls fromπ0. When H0is true, the sampling distribution of the z test statistic is approximatelythe standard normal distribution.The test statistic has a similar form as in tests for a mean. - p. 153 4. P-value The P-value is a one- or two-tail probability, as in tests for a mean, except usingthe standard normal distribution rather than the t distribution. For Ha: π = π0, Pis the two-tail probability. See Figure 6.6. This probability is double the single-tailprobability beyond the observed z-value.For one-sided alternatives, the P-value is a one-tail probability. Since Ha: π>π0predicts that the population proportion is larger than the H0value, its P-value is theprobability above (i.e., to the right) of the observed z-value. For Ha: π<π0,theP-value is the probability below (i.e., to the left) of the observed z-value. 5. Conclusion As usual, the smaller the P-value, the more strongly the data contradict H0and sup-port Ha. When we need to make a decision, we reject H0if P ≤ α for a prespecifiedα-level such as 0.05.

THE α-LEVEL: USING THE P-VALUE TO MAKE A DECISION

A significance test analyzes the strength of the evidence against the null hypothesis, H0. We start by presuming that H0 is true. We analyze whether the data would beunusual if H0were true by finding the P-value. If the P-value is small, the data con-tradict H0and support Ha. Generally, researchers do not regard the evidence againstH0as strong unless P is very small, say, P ≤ 0.05 or P ≤ 0.01. Why do smaller P-values indicate stronger evidence against H0? Because thedata would then be more unusual if H0were true. When H0is true, the P-value isroughly equally likely to fall anywhere between 0 and 1. By contrast, when H0isfalse, the P-value is more likely to be near 0 than near 1. Sometimes we need to decide whether the evidence against H0is strong enoughto reject it. We base the decision on whether the P-value falls below a prespecifiedcutoff point. For example, we could reject H0if P ≤ 0.05 and conclude that theevidence is not strong enough to reject H0if P > 0.05. The boundary value 0.05 is called the α-level of the test.

limitations of significance tests

A significance test makes an inference about whether a parameter differs from theH0value and about its direction from that value. In practice, we also want to know whether the parameter is sufficiently different from the H0value to be practically important. In this section, we'll learn that a test does not tell us as much as a confidence interval about practical importance.

significance test

A statistical significance test uses data to summarize the evidence about a hypothesis. Itdoes this by comparing point estimates of parameters to the values predicted by the hypothesis.The following example illustrates concepts behind significance tests. ex. 6.1 - p. 139 1Howcouldthewomenemployeesstatistically back up their assertion?Suppose the employee pool for potential selection for management training is half male andhalf female. Then, the company's claim of a lack of gender bias is a hypothesis. It states that,other things being equal, at each choice the probability of selecting a female equals 1/2 and theprobability of selecting a male equals 1/2. If the employees truly are selected for managementtraining randomly in terms of gender, about half the employees picked should be females andabout half should be male. The women's claim is an alternative hypothesis that the probability ofselecting a male exceeds 1/2. Suppose that 9 of the 10 employees chosen for management training were male. We might beinclined to believe the women's claim. However, we should analyze whether these results wouldbe unlikely if there were no gender bias. Would it be highly unusual that 9/10 of the employeeschosen would have the same gender if they were truly selected at random from the employee pool? - Due to sampling variation, not exactly 1/2 of the sample need be male. How far above 1/2 mustthe sample proportion of males chosen be before we believe the women's claim? This chapter introduces statistical methods for summarizing evidence and making decisions about hypotheses. We first present the parts that all significance tests have in common. The rest of the chapter presents significance tests about population means and population proportions. We'llalso learn how to find and how to control the probability of an incorrect decision about a hypothesis

TYPE I AND TYPE II ERRORS FOR DECISIONS

Because of sampling variability, decisions in tests always have some uncertainty. The decision could be erroneous. The two types of potential errors are conventionally called Type I and Type II errors. When H0 is true, a Type I error occurs if H0 is rejected. When H0 is false, a Type II error occurs if H0 is not rejected. The two possible decisions cross-classified with the two possibilities for whetherH0is true generate four possible results. See Table 6.8.

CORRESPONDENCE BETWEEN TWO-SIDED TESTS AND CONFIDENCE INTERVALS

Conclusions using two-sided significance tests are consistent with conclusions usingconfidence intervals. If a test says that a particular value is believable for the parameter, then so does a confidence interval. Confidence Interval for Mean Political Ideology (EX.) - p. 147 For the data in Example 6.2, let'sconstruct a 95% confidence interval for the Hispanic population mean political ide-ology. With df = 368, the multiple of the standard error (se = 0.0697) is t.025= 1.966.Since¯y = 4.089, the confidence interval is --- refer to pic

Hypotheses

Each significance test has two hypotheses about the value of a population parameter Null Hypothesis & Alternative Hypothesis - The null hypothesis, denoted by the symbol H0, is a statement that theparameter takes a particular value. The alternative hypothesis, denoted by Ha, states that the parameter falls in some alternative range of values.Usually the value in H0corresponds, in a certain sense, to no effect. The values in Ha then represent an effect of some type In Example 6.1 about possible gender discrimination in selecting managementtrainees, let π denote the probability that any particular selection is a male. The com-pany claims that π = 1/2. This is an example of a null hypothesis, no effect referringto a lack of gender bias. The alternative hypothesis reflects the skeptical women em-ployees' belief that this probability actually exceeds 1/2. So, the hypotheses are H0:π = 1/2 and Ha: π>1/2. Note that H0has a single value whereas Hahas a range ofvalues. A significance test analyzes the sample evidence about H0, by investigating whether the data contradict H0, hence suggesting that Ha is true The approach taken is the indirect one of proof by contradiction. The null hypothesis is presumed to be true. Under this presumption, if the data observed would be very unusual, the evidence supports the alternative hypothesis. In the study of potential gender dis-crimination, we presume that H0: π = 1/2 is true. Then we determine whether thesample result of 9 men selected for management training in 10 choices would be un-usual, under this presumption. If so, then we may be inclined to believe the women'sclaim. But, if the difference between the sample proportion of men chosen (9/10)and the H0value of 1/2 could easily be due to ordinary sampling variability, there'snot enough evidence to accept the women's claim. - A researcher usually conducts a test to gauge the amount of support for the alter-native hypothesis, as that typically reflects an effect that he or she predicts. Thus, Ha is sometimes called the research hypothesis. The hypotheses are formulated before collecting or analyzing the data.

significance test for a mean

For quantitative variables, significance tests usually refer to population means. The five parts of the significance test for a single mean follow: - p. 143

IMPLICIT ONE-SIDED H0 FOR ONE-SIDED Ha

From Example 6.4, the one-sided P-value = 0.018. So, if μ = 0, the probability equals0.018 of observing a sample mean weight gain of 3.01 or greater. Now, suppose μ<0;that is, the population mean weight change is negative. Then, the probability of ob-serving¯y ≥ 3.01 would be even smaller than 0.018. For example, a sample value of¯y = 3.01 is even less likely when μ =−5 than when μ = 0, since 3.01 is farther outin the tail of the sampling distribution of¯y when μ =−5 than when μ = 0. Thus,rejection of H0: μ = 0 in favor of Ha: μ>0 also inherently rejects the broader nullhypothesis of H0: μ ≤ 0. In other words, one concludes that μ = 0 is false and thatμ<0 is false - p.149

sample problem (significance test for the mean) p. 145

General Social Survey. For instance, that survey asks where you would place yourself on a seven-point scale of political views ranging from extremely liberal, point 1, to extremely conservative, point 7. Table 6.2shows the scale and the distribution of responses among the levels for the 2014 survey. Results are shown separately according to subjects classified as white, black, or Hispanic. Political ideology is an ordinal scale. Often, we treat such scales in a quantitativemanner by assigning scores to the categories. Then we can use quantitative summaries such as means, allowing us to detect the extent to which observations gravitate toward the conservative or the liberal end of the scale. 6.2, then a mean below 4 shows a propensity toward liberalismand a mean above 4 shows a propensity toward conservatism. We can test whetherthese data show much evidence of either of these by conducting a significance testabout how the population mean compares to the moderate value of 4. We'll do thishere for the Hispanic sample and in Section 6.5 for the entire sample 1. Assumptions: The sample is randomly selected. We are treating political ideol-ogy as quantitative with equally spaced scores. The t test assumes a normal pop-ulation distribution for political ideology, which seems inappropriate becausethe measurement of political ideology is discrete. We'll discuss this assumptionfurther at the end of this section. 2. Hypotheses: Let μ denote the population mean ideology for Hispanic Americans, for this seven-point scale. The null hypothesis contains one specified valuefor μ. Since we conduct the analysis to check how, if at all, the population meandeparts from the moderate response of 4, the null hypothesis is H0: μ = 4.0. The alternative hypothesis is then Ha: μ = 4.0. (p. 146) The null hypothesis states that, on the average, the population response is po-litically "moderate, middle of road." The alternative states that the mean falls in the liberal direction (μ<4.0) or in the conservative direction (μ>4.0). 3. Test statistic: The 369 observations in Table 6.2 for Hispanics are summarizedby¯y = 4.089 and s = 1.339. The estimated standard error of the sampling distribution of¯y is se =s√n=1.339√369= 0.0697 The value of the test statistic is t =¯y − μ0se=4.089 − 4.0/0.0697= 1.283 The sample mean falls 1.283 estimated standard errors above the null hypoth-esis value of the mean. The df = 369 − 1 = 368. 4. P-value: The P-value is the two-tail probability, presuming H0 is true, that t would exceed 1.283 in absolute value. From the t distribution with df = 368,this two-tail probability is P = 0.20. If the population mean ideology were 4.0,then the probability equals 0.20 that a sample mean for n = 368 subjects wouldfall at least as far from 4.0 as the observed¯y of 4.089. 5. Conclusion: The P-value of P = 0.20 is not very small, so it does not contradict H0. If H0 were true, the data we observed would not be unusual. It is plausible that the population mean response for Hispanic Americans in 2014 was 4.0, not leaning in the conservative or liberal direction.

Form of Test Statistic in Test for a Proportion

Here, the estimate ˆπ of the proportion replaces the estimate¯y of the mean, and thenull hypothesis proportion π0replaces the null hypothesis mean μ0. Note that in the standard error formula,π(1 − π)/n, we substitute the nullhypothesis value π0for the population proportion π. The parameter values in sam-pling distributions for tests presume that H0is true, since the P-value is based onthat presumption. This is why, for tests, we use se0=π0(1 − π0)/n rather than theestimated standard error, se =ˆπ(1 − ˆπ)/n. If we instead used the estimated se,the normal approximation for the sampling distribution of z would be poorer. Thisis especially true for proportions close to 0 or 1. By contrast, the confidence interval method does not have a hypothesized value for π, so that method uses the estimated se rather than a H0 value. - p. 153

NEVER "ACCEPT H0"

In Example 6.6 about raising taxes or reducing services, the P-value of 0.17 was notsmall. So, H0: π = 0.50 is plausible. In this case, the conclusion is sometimes reported as "Do not reject H0," since the data do not contradict H0. It is better to say "Do not reject H0" than "Accept H0." The population propor-tion has many plausible values besides the number in H0. For instance, the softwareoutput above reports a 95% confidence interval for the population proportion π as(0.49, 0.55). This interval shows a range of plausible values for π. Even though in-sufficient evidence exists to conclude that π = 0.50, it is improper to conclude thatπ = 0.50. In summary, H0 contains a single value for the parameter. When the P-value islarger than the α-level, saying "Do not reject H0" instead of "Accept H0" emphasizesthat that value is merely one of many believable values. Because of sampling variability, there is a range of believable values, so we can never accept H0. The reason"accept Ha" terminology is permissible for Ha is that when the P-value is sufficiently small, the entire range of believable values for the parameter falls within the range of values that Ha specifies

EFFECT OF SAMPLE SIZE ON P-VALUES

In Example 6.6 on raising taxes or cutting services, suppose ˆπ = 0.52 had beenbased on n = 4800 instead of n = 1200. The standard error then decreases to 0.0072(half as large), and you can verify that the test statistic z = 2.77. This has two-sidedP-value = 0.006. That P-value provides strong evidence against H0: π = 0.50 andsuggests that a majority support raising taxes rather than cutting services. In thatcase, though, the 95% confidence interval for π equals (0.506, 0.534). This indicatesthat π is quite close to 0.50 in practical terms. A given difference between an estimate and the H0value has a smaller P-valueas the sample size increases. The larger the sample size, the more certain we canbe that sample deviations from H0are indicative of true population deviations. Inparticular, notice that even a small departure between ˆπ and π0(or between¯y andμ0) can yield a small P-value if the sample size is very large

AS P(TYPE I ERROR) GOES DOWN, P(TYPE II ERROR) GOES UP

In an ideal world, Type I or Type II errors would not occur. However, errors dohappen. We've all read about defendants who were convicted but later determinedto be innocent. When we make a decision, why don't we use an extremely smallP(Type I error), such as α = 0.000001? For instance, why don't we make it almostimpossible to convict someone who is really innocent? When we make α smaller in a significance test, we need a smaller P-value toreject H0. It then becomes harder to reject H0. But this means that it will also beharder even if H0is false. The stronger the evidence required to convict someone,the more likely we will fail to convict defendants who are actually guilty. In otherwords,thesmallerwemakeP(Type I error), the larger P(Type II error) becomes,that is, failing to reject H0even though it is false If we tolerate only an extremely small P(Type I error), such as α = 0.000001, thetest may be unlikely to reject H0even if it is false—for instance, unlikely to convictsomeone even if they are guilty. This reasoning reflects the fundamental relation: The smaller P(Type I error) is, the larger P(Type II error) is. Section 6.6 shows that P(Type II error) depends on just how far the true param-eter value falls from H0. If the parameter is nearly equal to the value in H0, P(Type II error) is relatively high. If it falls far from H0, P(Type II error) is relatively low. The farther the parameter falls from the H0 value, the less likely the sample is to result in a Type II error. For a fixed P(Type I error), P(Type II error) depends also on the sample size n.The larger the sample size, the more likely we are to reject a false H0. To keep bothP(Type I error) and P (Type II error) at low levels, it may be necessary to use a very large sample size. The P(Type II error) may be quite large when the sample size is small, unless the parameter falls quite far from the H0 value.

THE CHOICE OF ONE-SIDED VERSUS TWO-SIDED TESTS

In practice, two-sided tests are more common than one-sided tests. Even if a re-searcher predicts the direction of an effect, two-sided tests can also detect an ef-fect that falls in the opposite direction. In most research articles, significance testsuse two-sided P-values. Partly this reflects an objective approach to research thatrecognizes that an effect could go in either direction. In using two-sided P-values,researchers avoid the suspicion that they chose Hawhen they saw the direction inwhich the data occurred. That is not ethical.Two-sided tests coincide with the usual approach in estimation. Confidence intervals are two sided, obtained by adding and subtracting some quantity from the point estimate. One can form one-sided confidence intervals, for instance, having95% confidence that a population mean weight change is at least equal to 0.8 pounds(i.e., between 0.8 and ∞), but in practice one-sided intervals are rarely used. In deciding whether to use a one-sided or a two-sided Ha in a particular exercise or in practice, consider the context. An exercise that says "Test whether the mean has changed" suggests a two-sided alternative, to allow for increase or decrease. "Test whether the mean has increased" suggests the one-sided Ha: μ>μ0 In either the one-sided or two-sided case, hypotheses always refer to population parameters, not sample statistics. So, never express a hypothesis using sample statistic notation, such as H0:¯y = 0. There is no uncertainty or need to conduct statistical inference about sample statistics such as¯y, because we can calculate their values exactly from the data.

Hypothesis

In statistics, a hypothesis is a statement about a population. It takes the form of a prediction that a parameter takes a particular numerical value or falls in a certain range of values Examples of hypotheses are the following: "For restaurant managerial employees, the meansalary is the same for women and for men"; "There is no difference between Democrats and Re-publicans in the probabilities that they vote with their party leadership"; and "A majority of adultCanadians are satisfied with their national health service."

Significance Test Decisions and Confidence Intervals

In testing H0: μ = μ0against Ha: μ = μ0, when we reject H0at the 0.05α-level, the 95% confidence interval for μ does not contain μ0. The 95% confidence interval consists of those μ0 values for which we do not reject H0at the 0.05 α-level. p. 158

STATISTICAL SIGNIFICANCE VERSUS PRACTICAL SIGNIFICANCE

It is important to distinguish between statistical significance and practical significance.AsmallP-value, such as P = 0.001, is highly statistically significant. It providesstrong evidence against H0. It does not, however, imply an important finding in anypractical sense. The small P-value merely means that if H0were true, the observeddata would be very unusual. It does not mean that the true parameter value is farfrom H0in practical terms. ex. - p. 159 The two-sided P-value is P = 0.0001. There is very strong evidence that the truemean exceeds 4.0, that is, that the true mean falls on the conservative side of moder-ate. But, on a scale of 1.0 to 7.0, 4.108 is close to the moderate score of 4.0. Althoughthe difference of 0.108 between the sample mean of 4.108 and the H0mean of 4.0 ishighly significant statistically, the magnitude of this difference is very small in prac-tical terms. The mean response on political ideology for all Americans is essentially a moderate one. A way of summarizing practical significance is to measure the effect size by thenumber of standard deviations (not standard errors) that¯y falls from μ0.Inthisex-ample, the estimated effect size is (4.108 − 4.0)/1.425 = 0.08. This is a tiny effect.Whether a particular effect size is small, medium, or large depends on the substantive context, but an effect size of about 0.2 or less in absolute value is usually not practically important. - p. 160

the 5 parts of a significance test

Now let's take a closer look at the significance test method, also called a hypothesis test ,or test for short. All tests have five parts: Assumptions, Hypotheses, Test statistic, P-value, Conclusion ASSUMPTIONS Each test makes certain assumptions or has certain conditions for the test to be valid. These pertain to - Type of data: Like other statistical methods, each test applies for either quan-titative data or categorical data. - Randomization: Like other methods of statistical inference, a test assumes thatthe data gathering employed randomization, such as a random sample - Population distribution: Some tests assume that the variable has a particularprobability distribution, such as the normal distribution. - Sample size: Many tests employ an approximate normal or t sampling distribu-tion. The approximation is adequate for any n when the population distributionis approximately normal, but it also holds for highly nonnormal populationswhen the sample size is relatively large, by the Central Limit Theorem.

conclusion

The P-value summarizes the evidence against H0. Our conclusion should also inter-pret what the P-value tells us about the question motivating the test. Sometimes it is necessary to make a decision about the validity of H0. IftheP-value is sufficientlysmall, we reject H0and accept Ha.Most studies require very small P-values, such as P ≤ 0.05, in order to reject H0.In such cases, results are said to be significant at the 0.05 level. This means that if H0were true, the chance of getting such extreme results as in the sample data would beno greater than 0.05.Making a decision by rejecting or not rejecting a null hypothesis is an optionalpart of the significance test. We defer discussion of it until Section 6.4.

MAKING DECISIONS VERSUS REPORTING THE P-VALUE

The approach to hypothesis testing that incorporates a formal decision with a fixedP(Type I error) was developed by the statisticians Jerzy Neyman and Egon Pearsonin the late 1920s and early 1930s. In summary, this approach formulates null andalternative hypotheses, selects an α-level for the P(Type I error), determines therejection region of test statistic values that provide enough evidence to reject H0,and then makes a decision about whether to reject H0according to what is actuallyobserved for the test statistic value. With this approach, it's not even necessary tofind a P-value. The choice of α-level determines the rejection region, which togetherwith the test statistic determines the decision. The alternative approach of finding a P-value and using it to summarize evidenceagainst a hypothesis is due to the great British statistician R. A. Fisher. He advocatedmerely reporting the P-value rather than using it to make a formal decision about H0.Over time, this approach has gained favor, especially since software can now reportprecise P-values for a wide variety of significance tests.This chapter has presented an amalgamation of the two approaches (thedecision-based approach using an α-level and the P-value approach), so you can in-terpret a P-value yet also know how to use it to make a decision when that is needed.

REJECTION REGIONS: STATISTICALLY SIGNIFICANT TEST STATISTIC VALUES

The collection of test statistic values for which the test rejects H0is called the rejec-tion region. For example, the rejection region for a test of level α = 0.05 is the set oftest statistic values for which P ≤ 0.05. For two-sided tests about a proportion, the two-tail P-value is ≤ 0.05 wheneverthe test statistic |z|≥1.96. In other words, the rejection region consists of values of z resulting from the estimate falling at least 1.96 standard errors from the H0 value

Test Statistic

The parameter to which the hypotheses refer has a point estimate. The test statisticsummarizes how far that estimate falls from the parameter value in H0. Often this is expressed by the number of standard errors between the estimate and the H0 value

-------

The predictions, which often result from the theory that drives the research, are hypotheses about the study population.

ROBUSTNESS FOR VIOLATIONS OF NORMALITY ASSUMPTION

The t test for a mean assumes that the population distribution is normal. This ensures that the sampling distribution of the sample mean¯y is normal (even for small n) and, after using s to estimate σ in finding the se, the t test statistic has the t distribution. As n increases, this assumption of a normal population becomes less important. We've seen that when n is roughly about 30 or higher, an approximate normal sampling distribution occurs for¯y regardless of the population distribution, by the Central Limit Theorem. From Section 5.3 (page 113), a statistical method is robust if it performs ade-quately even when an assumption is violated. Two-sided inferences for a mean usingthe t distribution are robust against violations of the normal population assumption.Even if the population is not normal, two-sided t tests and confidence intervals stillwork quite well. The test does not work so well for a one-sided test with small n when the population distribution is highly skewed. ex. 6.5 - p. 151 - refer to pic As just mentioned, a two-sided t test works quite well even if the population distribution is skewed. However, this plot makes us wary about using a one-sided test, since the sample size is not large (n = 29). Given this and the discussion in the previous subsection about one-sided versus two-sided tests, we're safest with that study to report a two-sided P-value of 0.035. Also, the median may be a more relevant summary for these data.

alpha-level

The α-level is a number such that we reject H0 if the P-value is less than or equal to it. The α-level is also called the significance level.In practice, the most common α-levels are 0.05 and 0.01. Like the choice of a confidence level for a confidence interval, the choice of α reflects how cautious you want to be. The smaller the α-level, the stronger the evidence must be to reject H0. To avoid bias in the decision-making process, you select α before analyzing the data. example 6.5 - p. 150 Such a conclusion is sometimes phrased as "The increase in mean weight is statistically significant at the 0.05 level." Since P = 0.018 is not lessthan 0.010, the result is not statistically significant at the 0.010 level. In fact, the P-value is the smallest level for α at which the results are statistically significant.

example 6.6 - Significance test for a proportion - p. 153

These days, whether at the local, state, or nationallevel, government often faces the problem of not having enough money to pay for thevarious services that it provides. One way to deal with this problem is to raise taxes.Another way is to reduce services. Which would you prefer? When the Florida Pollrecently asked a random sample of 1200 Floridians, 52% (624 of the 1200) said raisetaxes and 48% said reduce services

P-VALUE

To interpret a test statistic value, we create a probability summary of the evidence against H0. This uses the sampling distribution of the test statistic, under the presumption that H0 is true. The purpose is to summarize how unusual the observed test statistic value is compared to what H0 predicts. Specifically, if the test statistic falls well out in a tail of the sampling distribution in a direction predicted by Ha, then it is far from what H0 predicts. We can summarizehow far out in the tail the test statistic falls by the tail probability of that value and ofmore extreme values. These are the possible test statistic values that provide at least as much evidence against H0 as the observed test statistic, in the direction predicted by Ha. This probability is called the P-value P-value: def: The P-value is the probability that the test statistic equals the observedvalue or a value even more extreme in the direction predicted by Ha.Itiscalculated by presuming that H0is true. The P-value is denoted by P - A small P-value (such as P = 0.01) means that the data we observed would have been unusual if H0 were true. The smaller the P-value, the stronger the evidence is against H For Example 6.1 on potential gender discrimination in choosing managerialtrainees, π is the probability of selecting a male. We test H0: π = 1/2 against Ha:π>1/2. One possible test statistic is the sample proportion of males selected, whichis 9/10 = 0.90. The values for the sample proportion that provide this much or evenmore extreme evidence against H0: π = 1/2 and in favor of Ha: π>1/2aretheright-tail sample proportion values of 0.90 and higher. See Figure 6.1. A formulafrom Section 6.7 calculates this probability as 0.01, so the P-value equals P = 0.01.If the selections truly were random with respect to gender, the probability is only0.01 of such an extreme sample result, namely, that 9 or all 10 selections would bemales. Other things being equal, this small P-value provides considerable evidence against H0: π = 1/2 and supporting the alternative Ha: π>1/2 of discrimination against females. By contrast, a moderate to large P-value means the data are consistent with H0. A P-value such as 0.26 or 0.83 indicates that, if H0 were true, the observed data would not be unusual.

SIGNIFICANCE TESTS AND P-VALUES CAN BE MISLEADING

We've seen it is improper to "accept H0." We've also seen that statistical significancedoes not imply practical significance. Here are other ways that results of significance tests can be misleading: It is misleading to report results only if they are statistically significant.Some research journals have the policy of publishing results of a study only ifthe P-value ≤ 0.05. Here's a danger of this policy: Suppose there truly is noeffect, but 20 researchers independently conduct studies. We would expectabout 20(0.05) = 1 of them to obtain significance at the 0.05 level merely bychance. (When H0is true, about 5% of the time we get a P-value below 0.05anyway.) If that researcher then submits results to a journal but the other 19researchers do not, the article published will be a Type I error. It will report aneffect when there really is not one. Some tests may be statistically significant just by chance. You should neverscan software output for results that are statistically significant and report onlythose. If you run 100 tests, even if all the null hypotheses are correct, you wouldexpect to get P-values ≤0.05 about 100(0.05) = 5 times. Be skeptical of reportsof significance that might merely reflect ordinary random variability It is incorrect to interpret the P-value as the probability that H0is true. TheP-value is P(test statistic takes value like observed or even more extreme),presuming that H0is true. It is not P(H0true). Classical statistical methodscalculate probabilities about variables and statistics (such as test statistics) thatvary randomly from sample to sample, not about parameters. Statistics havesampling distributions, parameters do not. In reality, H0is not a matter of prob-ability. It is either true or not true. We just don't know which is the case True effects are often smaller than reported estimates. Even if a statisticallysignificant result is a real effect, the true effect may be smaller than reported.For example, often several researchers perform similar studies, but the resultsthat receive attention are the most extreme ones. The researcher who decidesto publicize the result may be the one who got the most impressive sampleresult, perhaps way out in the tail of the sampling distribution of all the possibleresults. See Figure 6.8. p. 162 The moral is to be skeptical when you hear reports of new medical advances. Thetrue effect may be weaker than reported, or there may actually be no effect at all. Related to this is the publication bias that occurs when results of some studiesnever appear in print because they did not obtain a small enough P-value to seemimportant. One investigation8of this reported that 94% of medical studies that hadpositive results found their way into print whereas only 14% of those with disap-pointing or uncertain results did.

SIGNIFICANCE TESTS ARE LESS USEFUL THAN CONFIDENCE INTERVALS

We've seen that, with large sample sizes, P-values can be small even when the pointestimate falls near the H0value. The size of P merely summarizes the extent of evidence about H0, not how far the parameter falls from H0. Always inspect the difference between the estimate and the H0 value to gauge the practical implications of a test result Null hypotheses containing single values are rarely true. That is, rarely is theparameter exactly equal to the value listed in H0. With sufficiently large samples, sothat a Type II error is unlikely, these hypotheses will normally be rejected. What ismore relevant is whether the parameter is sufficiently different from the H0value tobe of practical importance Although significance tests can be useful, most statisticians believe they areoveremphasized in social science research. It is preferable to construct confidenceintervals for parameters instead of performing only significance tests. A test merelyindicates whether the particular value in H0is plausible. It does not tell us whichother potential values are plausible. The confidence interval, by contrast, displaysthe entire set of believable values. It shows the extent to which reality may differfrom the parameter value in H0by showing whether the values in the interval arefar from the H0value. Thus, it helps us to determine whether rejection of H0haspractical importance. When a P-value is not small but the confidence interval is quite wide, this forcesus to realize that the parameter might well fall far from H0even though we cannot re-ject it. This also supports why it does not make sense to "accept H0," as we discussedon page 155.

THE α-LEVEL IS THE PROBABILITY OF TYPE I ERROR

When H0 is true, let's find the probability of Type I error. Suppose α = 0.05.We've just seen that for the two-sided test about a proportion, the rejection region is|z|≥1.96. So, the probability of rejecting H0is exactly 0.05, because the probabilityof the values in this rejection region under the standard normal curve is 0.05. But this is precisely the α-level. The probability of a Type I error is the α-level for the test. With α = 0.05, if H0is true, the probability equals 0.05 of making a Type I errorand rejecting H0. We control P(Type I error) by the choice of α. The more seriousthe consequences of a Type I error, the smaller α should be. In practice, α = 0.05 ismost common, just as an error probability of 0.05 is most common with confidenceintervals (i.e, 95% confidence). However, this may be too high when a decision has serious implications For example, consider a criminal legal trial of a defendant. Let H0representinnocence and Harepresent guilt. The jury rejects H0and judges the defendant tobe guilty if it decides the evidence is sufficient to convict. A Type I error, rejecting atrue H0, occurs in convicting a defendant who is actually innocent. In a murder trial,suppose a convicted defendant may receive the death penalty. Then, if a defendant is actually innocent, we would hope that the probability of conviction is much smaller than 0.05. When we make a decision, we do not know whether we have made a Type Ior Type II error, just as we do not know whether a particular confidence intervaltruly contains the parameter value. However, we can control the probability of an incorrect decision for either type of inference

Decisions and Types of Errors in Tests

When we need to decide whether the evidence against H0is strong enough to rejectit, we reject H0if P ≤ α, for a prespecified α-level. Table 6.7 summarizes the twopossible conclusions for α-level = 0.05. The null hypothesis is either "rejected" or"not rejected." If H0is rejected, then Hais accepted. If H0is not rejected, then H0isplausible, but other parameter values are also plausible. Thus, H0is never "accepted."In this case, results are inconclusive, and the test does not identify either hypothesis as more valid. It is better to report the P-value than to indicate merely whether the result is "statistically significant." Reporting the P-value has the advantage that the readercan tell whether the result is significant at any level. The P-values of 0.049 and 0.001 are both "significant at the 0.05 level," but the second case provides much stronger evidence than the first case. Likewise, P-values of 0.049 and 0.051 provide, in practicalterms, the same amount of evidence about H0. It is a bit artificial to call one result"significant" and the other "nonsignificant." Some software places the symbol * next to a test statistic that is significant at the 0.05 level, ** next to a test statistic that is significant at the 0.01 level, and *** next to a test statistic that is significant at the 0.001 level.

Test about Mean Weight Change in Anorexic Girls - Ex. - p. 148

p. 148 Now, for n = 29, df = n − 1 = 28. The P-value equals 0.02. Software can find the P-value for you. For instance, for the one-sided and two-sided alternatives witha data file with variable change for weight change, R reports The one-sided P-value is 0.035/2 = 0.018. The evidence against H0is relativelystrong. It seems that the treatment has an effect.The significance test concludes that the mean weight gain was not equal to 0.But the 95% confidence interval of (0.2, 5.8) is more informative. It shows just how different from 0 the population mean change is likely to be. The effect could bevery small. Also, keep in mind that this experimental study (like many medically oriented studies) had to use a volunteer sample. So, these results are highly tentative, another reason that it is silly for studies like this to report P-values to several decimal places.

THE FIVE PARTS OF A SIGNIFICANCE TEST FOR A MEAN

1. Assumptions The test assumes the data are obtained using randomization, such as a random sam-ple. The quantitative variable is assumed to have a normal population distribution.We'll see that this is mainly relevant for small sample sizes and certain types of Ha 2. Hypotheses The null hypothesis about a population mean μ has the form H0: μ = μ0 where μ0is a particular value for the population mean. In other words, the hypoth-esized value of μ in H0is a single value. This hypothesis usually refers to no effector no change compared to some standard. For example, Example 5.5 in the previ-ous chapter (page 117) estimated the population mean weight change μ for teenagegirls after receiving a treatment for anorexia. The hypothesis that the treatment hasno effect is a null hypothesis, H0: μ = 0. Here, the H0value μ0for the parameterμ is 0. The alternative hypothesis contains alternative parameter values from the valuein H0. The most common alternative hypothesis is Ha: μ = μ0, such as Ha: μ = 0. This alternative hypothesis is called two-sided, because it contains values both below and above the value listed in H0. For the anorexia study, Ha: μ = 0 states that thetreatment has some effect, the population mean equaling some value other than 0. 3. Test Statistic The sample mean¯y estimates the population mean μ. When the population distribu-tion is normal, the sampling distribution of¯y is normal about μ. This is also approx-imately true when the population distribution is not normal but the random samplesize is relatively large, by the Central Limit Theorem. Under the presumption that H0: μ = μ0is true, the center of the sampling dis-tribution of¯y is the value μ0, as Figure 6.2 shows. A value of¯y that falls far out in the tail provides strong evidence against H0, because it would be unusual if truly μ = μ0. - refer to pic The evidence about H0 is summarized by the number of standard errors that¯y falls from the null hypothesis value μ0.Recall that the true standard error is σ¯y= σ/√n. As in Chapter 5, we substitutethe sample standard deviation s for the unknown population standard deviation σ toget the estimated standard error, se = s/√n. The test statistic is the t-score - p. 144 The farther¯y falls from μ0, the larger the absolute value of the t test statistic. Hence,the larger the value of |t|, the stronger the evidence against H0 We use the symbol t rather than z because, as in forming a confidence interval,using s to estimate σ in the standard error introduces additional error. The null sam-pling distribution of the t test statistic is the t distribution (see Section 5.3). It lookslike the standard normal distribution, having mean equal to 0 but being more spreadout, more so for smaller n. It is specified by its degrees of freedom, df = n − 1. 4. P-Value - p. 144 The test statistic summarizes how far the data fall from H0. Different tests use dif-ferent test statistics, though, and simpler interpretations result from transforming itto the probability scale of 0 to 1. The P-value does this. We calculate the P-value under the presumption that H0is true. That is, we give the benefit of the doubt to H0, analyzing how unusual the observed data would be ifH0were true. The P-value is the probability that the test statistic equals the observedvalue or a value in the set of more extreme values that provide even stronger evidenceagainst H0.ForHa: μ = μ0, the more extreme t-values are the ones even farther outin the tails of the t distribution. So, the P-value is the two-tail probability that thet test statistic is at least as large in absolute value as the observed test statistic. Thisis also the probability that¯y falls at least as far from μ0in either direction as theobserved value of¯y. Figure 6.3 shows the sampling distribution of the t test statistic when H0is true.A test statistic value of t = (¯y − μ0)/se = 0 results when¯y = μ0. This is the t-valuemost consistent with H0.TheP-value is the probability of a t test statistic value atleast as far from this consistent value as the one observed. To illustrate its calculation,suppose t = 1.283 for a sample size of 369. (This is the result in the example below.)This t-score means that the sample mean¯y falls 1.283 estimated standard errors aboveμ0.TheP-value is the probability that t ≥ 1.283 or t ≤−1.283 (i.e., |t|≥1.283).Since n = 369, df = n − 1 = 368 is large, and the t distribution is nearly identicalto the standard normal. The probability in one tail above 1.28 is 0.10, so the two-tailprobability is P = 2(0.10) = 0.20. 5. Conclusion Finally, the study should interpret the P-value in context. The smaller P is, the stronger the evidence against H0 and in favor of Ha. Sample Problem - 6.2 - p. 145

significance test for a proportion

For a categorical variable, the parameter is the population proportion for a category.For example, a significance test could analyze whether a majority of the population support legalizing same-sex marriage by testing H0: π = 0.50 against Ha: π>0.50, where π is the population proportion π supporting it. The test for a proportion, like the test for a mean, finds a P-value for a test statistic that measures the number of standard errors a point estimate falls from a H0 value

ONE-SIDED SIGNIFICANCE TESTS

We can use a different alternative hypothesis when a researcher predicts a deviation from H0 in a particular direction. It has one of the forms - refer to pic For Ha: μ<μ0,theP-value is the left-tail probability, below the observedt-score. A t-score of t =−1.283 with df = 368 results in P = 0.10 for this alter-native. A t-score of 1.283 results in P = 1 − 0.10 = 0.90.

EQUIVALENCE BETWEEN CONFIDENCE INTERVALS AND TEST DECISIONS

We now elaborate on the equivalence for means5between decisions from two-sided tests and conclusions from confidence intervals, first alluded to in Example 6.3(page 147). Consider the significance test of - refer to pic


Ensembles d'études connexes

Chapter 1 Smartbook (Operations Management)

View Set

Experiment 5- Limiting reactant

View Set

EAQ: Death & Dying/Spirituality/Culture

View Set

GRE3000 完整格式不完美版

View Set