CH15
Apply Your Knowledge 15.3: A retrospective study reported on complications from prostate needle biopsies by examining the medical records of 1000 consecutive cases of patients who underwent the procedure at one academic medical center between September 2001 and August 2010 Why is it risky to regard these 1000 patients as an SRS from the population of all patients who have a prostate needle biopsy? - Name some factors that might make these 1000 consecutive patient records unrepresentative of all prostate needle biopsy outcomes
There might be a pattern to these consecutive records, and they come from only one academic medical center—they may not be representative of other procedures performed elsewhere
Confidence Intervals & Hypothesis Tests are 2 Procedures for Statistical Inference
They both concern inference about the mean μ of a population when the "simple conditions" (page 348) are true: 1. The data are from an SRS 2. The population has a Normal distribution 3. We know the standard deviation σ of the population
Influences on the Sample Size (n) Needed for Sufficient Power:
1. If you insist on a SMALLER SIGNIFICANCE LEVEL (e.g., 1% rather than 5%), you will need a LARGER SAMPLE - A smaller significance level requires STRONGER EVIDENCE TO REJECT the NULL hypothesis 2. If you insist on HIGHER POWER (e.g., 90% rather than 80%), you will need a LARGER SAMPLE - Higher power gives a BETTER CHANCE OF DETECTING AN EFFECT WHEN IT IS REALLY THERE 3. At any significance level and desired power, a TWO-sided alternative requires a LARGER SAMPLE than a one-sided alternative 4. At any significance level and desired power, AIMING TO DETECT A SMALL EFFECT requires a LARGER SAMPLE than aiming to detect a large effect
How Confidence Intervals Behave
1. Larger N = Smaller m 2. m only accounts for Sampling Error - It ignores practical difficulties like undercoverage, nonresponse, or response bias
Check Your Skills 15.18: The most important condition for drawing sound conclusions from statistical inference is usually... a. That the data can be thought of as a random sample from the population of interest c. That the population distribution is exactly Normal c. That no calculation errors are made in the confidence interval or test statistic
A
Check Your Skills 15.23: Here's a quotation from a medical journal: "An uncontrolled experiment in 17 women found a significantly improved mean clinical symptom score after treatment" - "Methodologic flaws make it difficult to interpret the results of this study" The authors of this paper are skeptical about the significant improvement because... a. There is no control group, so the improvement might be due to the placebo effect or to the fact that many medical conditions improve over time b. The P-value given was P = 0.03, which is too large to be convincing c. The response variable might not have an exactly Normal distribution in the population
A
Check Your Skills 15.24: Vigorous exercise helps people live several years longer, on average - Whether mild activities like slow walking extend life, however, is not clear - Suppose that the added life expectancy from regular slow walking is just 2 months A statistical test is more likely to find a significant increase in mean life expectancy if... a. It is based on a very large random sample b. It is based on a very small random sample c. The size of the sample doesn't have any effect on the statistical significance of the test
A
Check Your Skills 15.25: A medical experiment compared the herb echinacea with a placebo for preventing colds - One response variable was "volume of nasal secretions" (if you have a cold, you blow your nose a lot) - We consider the average volume of nasal secretions in people without colds to be μ = 1 - An increase to μ = 3 indicates a cold In testing H₀: μ = 1 versus Hₐ: μ > 1, a Type I error would be... a. Rejecting H₀ when μ = 1 is true b. Rejecting H₀ when μ = 3 is true c. Failing to reject H₀ when μ > 1 is true
A
Check Your Skills 15.21: Many sample surveys use well-designed random samples, but half or more of the original sample can't be contacted or refuse to take part Any errors due to this nonresponse... a. Have no effect on the accuracy of confidence intervals b. Are included in the announced margin of error c. Are in addition to the random variation accounted for by the announced margin of error
C
Check Your Skills 15.27: A laboratory scale is known to have a standard deviation of σ = 0.001 gram in repeated weighings - Scale readings in repeated weighings are Normally distributed, with mean equal to the true weight of the specimen How many times must you weigh a specimen on this scale to get a margin of error no larger than ± 0.0005 with 95% confidence? a. 4 times b. 15 times c. 16 times
C n = [(z*)(σ)/m]² - (1.96)²(0.001)²/(0.0005)² - (3.842)(0.000005)/0.00000025 - 0.000003842/0.00000025 - 15.37 ≈ 16
Type II Error (β)
FAILING TO REJECT H₀ when it is actually TRUE - The chance that a decision to fail to reject H₀ is this error type depends on a number of factors
Ideally, we would like both High Confidence and a Small Margin of Error
High confidence says that our method almost always gives correct answers - A small margin of error says that we have pinned down the parameter quite precisely
Z Procedures
Inference procedures that start with the ONE-SAMPLE Z STATISTIC and use the STANDARD NORMAL DISTRIBUTION Ex: - Under the simple conditions, a confidence interval for the mean μ is M ± m, in which the margin of error m uses a critical value z* as a multiplier - To test a hypothesis H₀: μ = μ₀, we obtain the one-sample z statistic and corresponding P-value
There is No Sharp Border Between "Statistically Significant" and Not, Only Increasingly Strong Evidence as the P-value Decreases
It makes no sense to treat P ≤ 0.05 as a universal rule for what is statistically significant - On the one hand, there is no practical distinction between the P-values 0.048 and 0.052 - On the other hand, the P-value 0.004 offers clearly stronger evidence against H₀ than the P-value 0.04 Providing the exact P-value of a test, or at least its order of magnitude, allows each of us to decide individually if the evidence against H₀ is sufficiently strong
How Small of a P-Value is Convincing Evidence?
This depends to a large extent on 2 circumstances: 1. HOW PLAUSIBLE IS H₀? - If H₀ represents an assumption that the people you must convince have long believed to be true, strong evidence (small P) will be needed to persuade them to reject H₀ 2. What are the CONSEQUENCES OF REJECTING H₀? - If, for example, rejecting H₀ in favor of Hₐ means making an expensive changeover from one established therapy to a newer one, you need strong evidence that the new treatment will save or improve lives - In contrast, preliminary studies do not require taking any action beyond deciding whether to pursue a scientific line of inquiry: It makes sense to continue work in light of even weakly significant evidence
There is No Simple Rule for Deciding when you can act as if a Sample is an SRS - Pay attention to these Cautions::
1. PRACTICAL PROBLEMS (e.g., nonresponse in samples; dropouts from an experiment) CAN HINDER INFERENCE FROM EVEN a WELL-DESIGNED STUDY - NHANES has about an 80% response rate - This is much higher than the response rates for opinion polls and most other national surveys, so by realistic standards NHANES data are quite trustworthy - (NHANES uses advanced methods to try to correct for nonresponse) 2. DIFFERENT METHODS are NEEDED FOR DIFFERENT DESIGNS - The z procedures aren't correct for probability samples more complex than an SRS - Later chapters give methods for some other designs, but we won't discuss inference for really complex designs like that used by NHANES - Always be sure that you (or your statistical consultant) know how to carry out the inference your design calls for 3. There is NO CURE FOR FUNDAMENTAL STUDY DESIGN FLAWS like VOLUNTARY RESPONSE SURVEYS, UNCONTROLLED EXPERIMENTS, or BIASED SAMPLES - Look back at the bad examples in Chapters 6 and 7 and steel yourself to just ignore data from such studies
Margin of Error (m) gets Smaller when:
1. z* is smaller - A SMALLER CRITICAL VALUE z* is the same as a LOWER CONFIDENCE LEVEL C (look at Figure 14.3, page 353) - There is a trade-off between the confidence level and the margin of error - To obtain a smaller margin of error from the same data, you MUST BE WILLING TO ACCEPT LOWER CONFIDENCE - This is NOT USEFUL 2. σ is smaller - The STANDARD DEVIATION σ MEASURES the VARIATION IN the POPULATION - You can think of the VARIATION AMONG INDIVIDUALS IN the POPULATION as noise that OBSCURES THE AVERAGE VALUE μ - It is EASIER TO PIN DOWN μ WHEN σ IS SMALL - Some of that noise may come from measurement errors, so it could be reduced to some extent by using more precise measuring instruments (e.g., measuring heights with a precision of 1 millimeter rather than 1 centimeter) - A large part of the variation among individuals in a population, however, is simply NATURAL VARIATION THAT CANNOT BE REDUCED (e.g., individual heights) 3. n is larger - INCREASING the SAMPLE SIZE N REDUCED the MARGIN OF ERROR FOR ANY CONFIDENCE LEVEL - As a consequence, larger samples allow MORE PRECISE ESTIMATES - However, because n appears under a square root sign, we must TAKE FOUR TIMES AS MANY OBSERVATIONS to cut the margin of error in half - Practical issues (i.e., time, resources, cost, or even ethics) may also limit just how much data can be collected
Margin of Error (m) Accounts only for Sampling Error
A sampling distribution describes how a statistic such as M varies in repeated random samples - This variation causes "random sampling error" because the statistic misses the true parameter by a random amount - Sampling distributions do not take into account any other source of variation or bias in the sample data Thus, the MARGIN OF ERROR in a confidence interval IGNORES EVERYTHING EXCEPT THE SAMPLE-TO-SAMPLE VARIATION DUE exclusively TO CHOOSING THE SAMPLE RANDOMLY - The margin of error in a confidence interval covers only random sampling errors - Practical difficulties such as undercoverage, nonresponse, or response bias are often more serious than random sampling error, yet they cannot be quantified by the margin of error - A confidence interval simply ignores such difficulties
Check Your Skills 15.20: You turn your browser to a website featuring a Quick Vote poll, which allows site visitors to choose an answer to the question of the day - You view yesterday's poll results, which are based on 26,494 responses You should refuse to calculate any 95% confidence interval based on this sample because... a. Yesterday's responses are meaningless today b. Inference from a voluntary response sample can't be trusted c. The sample is too large
B
Check Your Skills 15.22: Which of the following questions does a hypothesis test evaluate? a. The adequacy of the study design b. Chance variations in random sampling c. The importance of an observed effect
B
Check Your Skills 15.26: In Exercise 15.25, the power of the test against the specific alternative μ = 3 is... a. The probability that the test rejects H₀ when μ = 1 is true b. The probability that the test rejects H₀ when μ = 3 is true c. The probability that the test fails to reject H₀ when μ = 3 is true
B
Exercise 15.31: A medical panel prepared guidelines for when cardiac pacemakers should be implanted in patients with heart problems - The panel reviewed a large number of medical studies to judge the strength of the evidence supporting each recommendation For each recommendation, they ranked the evidence as level A (strongest), B, or C (weakest) - Here, in scrambled order, are the panel's descriptions of the three levels of evidence Which is A, which B, and which C? - Explain your ranking Evidence was ranked as level _______ when data were derived from a limited number of trials involving comparatively small numbers of patients or from well-designed data analysis of nonrandomized studies or observational data registries Evidence was ranked as level _______ if the data were derived from multiple randomized clinical trials involving a large number of individuals Evidence was ranked as level _______ when consensus of expert opinion was the primary source of recommendation
B, A, C, respectively
Sample Size (n) Affects Statistical Significance
Because LARGE random samples have SMALL chance variation, VERY SMALL deviations from the null hypothesis can be HIGHLY SIGNIFICANT if the sample is LARGE Because SMALL random samples have A LOT of chance variation, even LARGE deviations from the null hypothesis can FAIL TO BE STATISTICALLY SIGNIFICANT if the sample is SMALL
Check Your Skills 15.19: The coach of a college basketball team records the resting pulse rates of the team members A confidence interval for the mean resting pulse rate of all college-age adults based on these data is of little use because... a. The number of team members is small, so the margin of error will be large b. Many of the students in the team will probably refuse to respond c. The college students in the basketball team can't be considered a random sample from the population
C
Ex 15.3: In Example 14.3 (page 354), we examined the IQ scores of 31 seventh-grade girls in a Midwest school district. The data gave M = 105.84, and we know that σ = 15 A 95% confidence interval for the mean IQ score for all seventh-grade girls in this district is: - M ± z*(σ/√n) - 105.84 ± (1.960)(15/√31) - 105.84 ± 5.28 A 90% confidence interval based on the same data replaces the 95% critical value z* = 1.960 with the 90% critical value z* = 1.645 - This interval is: • M ± z*(σ/√n) • 105.84 ± (1.645)(15/√31) • 105.84 ± 4.43
Lower confidence results in a smaller margin of error, ± 4.43 instead of ± 5.28 In the same way, you can calculate that the margin of error for 99% confidence is larger (± 6.94) because the critical value multiplier is larger (z* = 2.576) Figure 15.1 compares these 3 confidence intervals If we had a sample of 62 girls, the margin of error for 95% confidence would decrease from ±5.28 (1.960 × 15/√31) to ± 3.73 (1.960 × 15/√62) - Doubling the sample size does not cut the margin of error in half, because the sample size n appears under a square root sign - Also keep in mind that a sample of a different size would most likely produce a different sample mean, so that the center of the confidence interval would be affected as well
Ex 15.1: A neurobiologist is interested in how our visual perception can be fooled by optical illusions - Her subjects are students in Neuroscience 101 at her university A sociologist at the same university uses students in Sociology 101 to examine attitudes toward the use of human subjects in science
Most neurobiologists would agree that it's safe to treat the students as an SRS of all people with normal vision - There is nothing unique about being a student that changes the neurobiology of optical illusions Students as a group are younger than the adult population as a whole - Even among young people, students as a group tend to come from more prosperous and better-educated homes - Even among students, this university isn't typical of all campuses - Even on this campus, students in a sociology course may have opinions that are quite different from those of engineering students or biology students - The sociologist can't reasonably act as if these students are a random sample from a population of general interest such as all adults or even all young adults in college
Power (1 - β)
Of a test against a specific alternative, it is the PROBABILITY THAT the TEST WILL REJECT H₀ AT A CHOSEN SIGNIFICANCE LEVEL α WHEN the SPECIFIED ALTERNATIVE VALUE OF the PARAMETER IS TRUE When this value is high, think of the test as being highly sensitive to deviations from the null hypothesis
Ex 15.2: The NHANES study that produced the height data for Ex 14.1 (page 348) uses a Complex Multistage Sample Design, so it's a bit of an oversimplification to treat the height data as coming from an SRS from the population of eight-year-old boys - This is a situation in which we assume an SRS might be used for a quick analysis of data from a more complex probability sample The 31 seventh-grade girls in the IQ scores study in Ex 14.3 were randomly chosen from the population of seventh-grade girls in a Midwest school district - This situation involves an actual SRS of the true population of interest The study of inorganic phosphorus in Ex 14.4 uses existing medical records from a random sample of 12 elderly men and women between the ages of 75 and 79 years - This uses an SRS taken from a small portion of the actual population of interest and may not be appropriate to answer the study's question
Overall, the NHANES sample closely resembles an SRS, but professional statisticians would use more complex inference procedures to match the more complex design of the sample We can treat these girls as an SRS from the population of interest However, all these medical records came from the archives of a single laboratory in Ohio - This is a major problem for 2 reasons 1. Individuals who have their blood tested in Ohio may not be representative of the whole population of Americans aged 75 to 79 years and of that population's overall susceptibility to various medical conditions 2. Different laboratories are known to produce slightly different blood test results - These 2 issues should be clearly discussed when presenting the results of the statistical analysis because they severely limit our ability to generalize the findings beyond the population of Ohio residents aged 75 to 79 years who would get their blood tested at that particular laboratory
Ex 15.9: Because the probabilities of the two types of error are just a rewording of significance level and power, we can see from Figure 15.2 what the error probabilities are for the test in Ex 15.8
P(Type I error) - P(reject H₀ when in fact μ = 0) - significance level α = 0.05 P(Type II error) - P(fail to reject H₀ when in fact μ = 0.8) - 1 - power - 1 - 0.812 - 0.188 The two Normal curves in Figure 15.2 are used to find the probabilities of a Type I error (top curve, μ = 0) and of a Type II error (bottom curve, μ = 0.8)
What is the Shape of the Population Distribution?
PARAMETRIC INFERENCE PROCEDURES (e.g., one-sample z procedures for a population mean) start with a MODEL FOR the DISTRIBUTION OF the RANDOM VARIABLE IN the POPULATION - Every aspect of the distribution is specified in the model, except for one parameter - This parameter can then be estimated with a confidence interval or evaluated with a hypothesis test - The assumptions, or conditions, for inference are the model specifications MANY of the most BASIC METHODS OF INFERENCE are DESIGNED FOR NORMAL POPULATIONS - That's the case for the z procedures as well as for the more practical procedures for inference about means that we will examine in Chapters 17 and 18 - Fortunately, we have a lot of leeway around this condition This flexibility arises because the z procedures and many other procedures designed for Normal distributions RELY ON the NORMALITY OF the SAMPLING DISTRIBUTION OF M, not Normality of individual observations in the population - The CENTRAL LIMIT THEOREM (page 333) tells us that M is MORE NORMAL THAN INDIVIDUAL OBSERVATIONS and that M IS MORE NORMAL FOR LARGER SAMPLE SIZES - In practice, the z procedures are reasonably accurate for any roughly symmetric distribution for samples of even moderate size - If the sample is large, the sampling distribution of M will be close to Normal even if individual measurements are strongly skewed, as Figure 13.5 (page 332) illustrates We RARELY KNOW the SHAPE OF the POPULATION DISTRIBUTION - In practice we rely on previous studies and on the analysis of sample data - Sometimes experience suggests that our data are likely to come from a roughly Normal distribution; or not • Ex: heights of people of the same sex and similar ages are close to Normal, but weights are not Always EXPLORE YOUR DATA BEFORE UNDERTAKING INFERENCE - When the data are chosen at random from a population, the shape of the sample data distribution mirrors the shape of the population distribution - Make a dotplot or a histogram of your data and see whether the shape is roughly Normal, or make a Normal quantile plot if you have access to the necessary technology Remember that SMALL SAMPLES HAVE A LOT OF CHANCE VARIATION, so NORMALITY IS DIFFICULT TO JUDGE FROM JUST A FEW OBSERVATIONS - Also, OUTLIERS in a sample CAN DISTORT the RESULTS of inference - Any inference procedure based on sample statistics that are not resistant to outliers (i.e., the sample mean M) can be strongly influenced by a few extreme observations - Always check for outliers and deal with them appropriately (correcting typos or excluding outliers that have a legitimate reason for being excluded) before performing the z procedures or other inference methods based on statistics like M that are not resistant to outliers When outliers are present or when the data suggest that the population is strongly non-Normal and your sample size is small, CONSIDER ALTERNATIVE METHODS that don't require Normality and are not sensitive to outliers
Exercise 15.29: The National AIDS Behavioral Surveys asked a random sample of 2673 adult heterosexuals how many sexual partners they had in the past year Why is an estimate based on these findings likely to be biased? - Does the margin of error of a 95% confidence interval for the mean number of sexual partners allow for this bias?
People are likely to lie about their sex lives - The margin of error does not take this into account
Exercise 15.41: Exercise 15.7 concerned detecting acid rain (rainfall with pH less than 5) from measurements made on a sample of n days when considering different sample sizes n - That exercise shows how the P-value for an observed sample mean M changes with n - It would be wise to do power calculations before deciding on a sample size Suppose that pH measurements follow a Normal distribution with standard deviation σ = 0.5 You plan to test the hypotheses: - H₀: μ = 5 - Hₐ: μ < 5 - at the 5% level of significance You want to use a test that will almost always reject H₀ when the true mean pH is 4.7 - Use the Power of a Test applet to find the power against the alternative μ = 4.7 for samples of size n = 5, n = 15, and n = 40 What happens to the power as the size of the sample increases? Which of these sample sizes are adequate for use in this setting?
Power 0.381, 0.751, and 0.984 As the sample size increases, power increases For a test that almost always rejects H₀ when μ = 4.7, we should use n = 40
Type I Error (α)
REJECTING the H₀ when it is actually TRUE - Whatever SIGNIFICANCE LEVEL we choose determines the chance that a decision to reject is this type of error
Here are the Q's we Must Answer to Decide How Large a Sample we Must Take when Conducting a Significance Test:
SIGNIFICANCE LEVEL - How much protection do we want against getting a statistically significant result from our sample when there really is no effect in the population? EFFECT SIZE - How large an effect in the population is important in practice? POWER - How confident do we want to be that our study will detect an effect of the size we think is important?
Ex 15.7: A quality control engineer plans a study to assess the impact of storage on the sweetness of a new cola - Ten trained tasters are available to rate the cola's sweetness on a 10-point scale before and after storage, so that each taster's judgment of loss of sweetness can be assessed - Industry records indicate that sweetness loss scores vary from taster to taster according to a Normal distribution with standard deviation σ = 1 To see if the taste test suggests that the cola does lose sweetness, the engineer will test: - H₀: μ = 0 - Hₐ: μ > 0 Are 10 tasters enough, or should more be used?
Significance level - Requiring significance at the 5% level is enough protection against declaring there is a loss in sweetness when no change would actually be found if we could look at the entire population - It means that when there is no change in sweetness in the population, 1 out of 20 samples of tasters will wrongly find a statistically significant loss Effect size - A mean sweetness loss of 0.8 point on the 10-point scale will be noticed by consumers, so it is important in practice Power - The engineer needs to be 90% confident that the test will detect a mean loss of 0.8 point in the population of all tasters - A significance level of 5% is standard in the food industry for detecting an effect - Thus, the objective is to have probability at least 0.9 that a test at the α = 0.05 level will reject the null hypothesis H₀: μ = 0 when the true population mean is μ = 0.8 The probability that the test successfully detects a sweetness loss of the specified size is the Power of the test - You can think of tests with high power as being highly sensitive to deviations from the null hypothesis - Here, the engineer decided on power 90% when the truth about the population is
Statistical Significance Does Not Imply Practical Importance
Statistical significance means "the sample showed an effect larger than would often occur just by chance if the null hypothesis was true" - It says nothing about how large the effect actually is - Likewise, it says nothing about the importance of the effect in context Always look at the actual size of an effect in context, as well as at its statistical significance A confidence interval is more helpful than a P-value to assess the approximate size of an effect - However, even a confidence interval means little without context Statistical significance does not tell us whether an effect is large enough to be important in context - That is, statistical significance is not the same thing as practical "significance," an everyday word commonly used as a synonym for "meaningful" or "important"
Statistical Significance Depends on Sample Size (n)
Statistical significance says that an observed effect (the difference between the sample statistic and the assumed value of the parameter under the null hypothesis) would be unlikely to happen if the null hypothesis was true This raises an important question: - Can a tiny effect be statistically significant? - Yes Think of the z test statistic for inference about a population mean as follows: - z = (M - μ₀)/(σ/√n) = (size of the observed effect)/(size of chance variation) The numerator M - μ₀ shows how far the data diverge from the parameter in the null hypothesis The denominator represents the size of chance variations from sample to sample, measured by the standard deviation of M, and is a function of the sample size Therefore, STATISTICAL SIGNIFICANCE DEPENDS ON the combination of three factors: 1. The SIZE OF the OBSERVED EFFECT (M - μ₀) 2. The VARIABILITY OF INDIVIDUALS IN the POPULATION (σ) 3. The SAMPLE SIZE (n) Because all three factors affect the P-value, we cannot conclude that a small P-value is necessarily due to a large effect size - Even a tiny effect can be highly statistically significant if the sample size is very large, individual variations are very small, or both Inversely, even a large effect may not reach statistical significance if the sample size is too small, the data are extremely variable, or both Therefore, we must always keep in mind that "failing to reject the null hypothesis" does not imply that the null hypothesis is true
Sample Size (n) for a Desired Margin of Error (m)
The Confidence Interval for the mean of a Normal population will have a specified Margin of Error m when the Sample Size is: - n = [(z*)(σ)/m]² The size of the sample determines the margin of error - The size of the population does not influence the sample size we need (as long as the population is much larger than the sample)
Statistical Significance Depends on the Alternative Hypothesis (Hₐ)
The P-value for a one-sided test is typically one-half the P-value for the two-sided test of the same Null Hypothesis using the same data - The two-sided P-value combines two equal areas, one in each tail of a symmetric density curve - The one-sided P-value is just one of these areas, in the direction specified by the Alternative Hypothesis It makes sense that the EVIDENCE AGAINST H₀ is STRONGER WHEN the ALTERNATIVE is ONE-sided, because the evidence is BASED ON the DATA PLUS INFORMATION ABOUT the DIRECTION OF POSSIBLE DEVIATIONS FROM H₀ - If you lack this added information, you must use a two-sided Alternative Hypothesis The added information should come from existing scientific knowledge or a legitimate inquiry established BEFORE data collection - Sample data is used as evidence, so it should never be the basis for determining the direction of the Alternative Hypothesis
What a P-Value Is Not:
The P-value is NOT THE PROBABILITY THAT THE NULL HYPOTHESIS IS TRUE - A large P-value says only that the sample data collected would not be surprising if the test assumptions and the Null Hypothesis were both correct - These data may be compatible with the population model we are investigating, but that doesn't mean that they would be incompatible with any other model - A large P-value, no matter how large, is simply inconclusive The P-value DOES NOT GIVE THE PROBABILITY THAT THE DATA ARE HTE RESULT OF CHANCE VARIATIONS ALONE - It gives the probability that the data (or data even more extreme) could be the result of chance variations alone if the test assumptions and the Null Hypothesis are both correct - Because sample data cannot tell us when the Null Hypothesis is actually true, the P-value DOES NOT MEASURE THE CHANCE THAT THE OBSERVED EFFECT IS JUST RANDOM CHOICE The P-value is NOT USEFUL IF YOU HAVEN'T CHECKED THAT ALL THE TEST ASSUMPTIONS ARE MET - A Hypothesis Test compares actual sample data in relation to a specified model for the data - This model is based on a set of assumed properties of the population and the data collection process - The Null Hypothesis is just one element of this model - Therefore, a small P-value can happen either when the test assumptions are not met, when the Null Hypothesis is not true, or both - This means that we can LEGITIMATELY REJECT THE NULL HYPOTHESIS WHEN the P-VALUE IS SMALL only IF WE KNOW THAT THE TEST ASSUMPTIONS ARE MET
Ex 15.6: Ex 14.10 (page 368) reports a study of the mean body temperature of healthy adults - We know that the population standard deviation is σ = 0.6 degrees Fahrenheit (°F) We want to estimate the mean body temperature μ for healthy adults within ± 0.05 °F with 95% confidence - How many healthy adults must we measure?
The desired margin of error is m = 0.05 °F For 95% confidence, the critical value is z* = 1.96 (from software or Table C) We know that σ = 0.6 °F Therefore, - n = [(z*)(σ)/m]² - (1.96)²(0.6)²/(0.05)² - (3.842)(0.36)/(0.0025) - 1.383/0.0025 - 553.2 ≈ 554 Because a sample of 553 healthy adults will give a slightly larger margin of error than desired, and a sample of 554 healthy adults a slightly smaller margin of error, we must recruit 554 healthy adults for the study - Always round up to the next higher whole number when finding a large enough n
Apply Your Knowledge 15.13: We discussed earlier how small a P-value should be so we can reject the null hypothesis - One factor mentioned was how plausible the null hypothesis is, such that rejecting a null hypothesis that is generally considered true may demand especially strong evidence (small significance level) Which of the studies described in the preceding Discussion would demand the smallest significance level for this particular reason? - Explain your reasoning
The earlier studies, when the scientific consensus was that bacteria could not thrive in extremely low pH environment
Any Confidence Interval or Hypothesis Test can be Trusted Only Under Specific Conditions (i.e., the 3 "Simple Conditions")
The final "simple condition"—that we know the population standard deviation σ of the quantitative variable studied—is rarely satisfied in practice - The z procedures are therefore of little practical use, especially in the life sciences - Fortunately, it's easy to do without the "known σ" condition The first two "simple conditions" (SRS, Normal population) are more difficult to escape if we are to trust the results of statistical inference As you plan inference, you should always ask, "WHERE DID THE DATA COME FROM?" For a QUANTIATIVE variable, you must often also ask, "What is the SHAPE OF the POPULATION DISTRIBUTION?"
Where Did the Data Come From?
The most important requirement for any inference procedure is that the data come from a process to which the laws of probability apply Inference is most reliable when the data come from a: 1. PROBABILITY SAMPLE - They use chance to choose respondents 2. RANDOMIZED COMPARATIVE EXPERIMENT - They use chance to assign subjects to treatments The deliberate use of CHANCE ENSURES THAT THE LAWS OF PROBABILITY APPLY TO the OUTCOMES, which in turn ENSURES THAT STATISTICAL INFERENCE MAKES SENSE If your data don't come from a probability sample or a randomized comparative experiment, your CONCLUSIONS MAY BE CHALLENGED - To answer the challenge, you must usually rely on SUBJECT-MATTER KNOWLEDGE, not on statistics It is common practice to apply statistics to data that are not produced by true random selection - When you see such a study, ask whether the data can be trusted as a basis for the conclusions of the study
The P-Value of a Hypothesis Test is...
The probability, when the population conforms to the model specified by the test assumptions, of obtaining a sample statistic at least as extreme (in the direction of the Alternative Hypothesis) as that actually observed if the Null Hypothesis was true
Ex 15.4: Facebook conducted a randomized experiment to see whether emotional contagion occurs on social media, in the absence of in-person interaction - The company reduced for one week the amount of emotional (positive, negative, or neutral) content from friends in the News Feed of a random sample of 689,003 Facebook users without informing them, then monitored the users' posts the following week
The published findings state that "omitting emotional content reduced the amount of words the person subsequently produced, both when positivity was reduced (z = -4.78, P < 0.001) and when negativity was reduced (z = -7.219, P < 0.001)" The findings are highly statistically significant (P < 0.001), yet they correspond to a Facebook user producing "an average of one fewer emotional word, per thousand words, over the following week" While the observed effect size is extremely small, the huge sample size helps make it highly statistically significant - What the very small P-values tell us here is that even something as tiny as one word in 1000, on average, would not likely be observed in a random sample of 689,003 Facebook users if manipulating the users' News Feeds had no effect on their following posts The randomized experiment showed that reducing the emotional content of Facebook users' News Feed for one week resulted in an effect size on the order of one fewer emotional word per 1000 words posted over the following week - The observed effect size is clearly very small, but Facebook is used a lot by a lot of people - So what is the practical importance of this finding?
Beware of Multiple Analyses &, If Possible, Replicate
The reasoning behind statistical significance works well if you decide what effect you are seeking, design a study to search for it, and use a hypothesis test to weigh the evidence you get. However, for every null hypothesis you reject because you found a small P-value, there is a chance that the decision is wrong, representing a Type I error Running one hypothesis test and reaching the 5% level of significance is reasonably good evidence that you have found something real - Running 20 tests and reaching that significance level only once is not Hypothesis tests are not designed to be used blindly over and over again in hopes that something will come up statistically significant - An improper approach known as "P-hacking" or "data dredging" The concern over multiple analyses applies to confidence intervals as well Another form of cherry-picking happens when only statistically significant findings are publicly reported - The resulting publication bias undermines the entire scientific process and lowers the usefulness of a P-value Statistical inference is a probabilistic approach that uses sample data to conclude about a population - This approach has a number of limitations, as we have discussed here - One way to improve on it is to replicate an entire study, from data collection to data analysis - When a study is reproduced and the findings are independently confirmed, the validity of scientific conclusions is reinforced
Ex 15.5: The objective of the Facebook experiment was to study emotional contagion on social media - The response variable chosen for that purpose was the relative number of emotional words in future posts However, posting fewer emotional words may not be the same thing as feeling emotions less intensely - Perhaps Facebook users tend to repeat words they read in other posts, simply because the words are fresh in their mind or because they seem trendy. - Therefore, we cannot tell whether the observed effect truly reflects "emotional contagion" or something more akin to herd behavior, which wouldn't have much practical importance at all
The study found a very small effect when some friends' posts containing at least one emotional word were removed - Most people (except for bullies) don't purposefully alter their posts to manipulate their friends' emotions on a large scale Therefore, even if the effect reflects true emotional contagion, the wording of friends' posts may not impact emotions much in a realistic setting However, even tiny effects can sometimes have a large impact in aggregate - Facebook has more than 1 billion active users worldwide and the company filters out all News Feeds using "a ranking algorithm that Facebook continually develops and tests in the interest of showing viewers the content they will find most relevant and engaging" In the long run, and over a massive worldwide scale, the overall effect could be meaningful
Ex 15.8: Finding the power of the z test is less challenging than most other power calculations because it requires only a Normal distribution probability calculation The Statistical Power applet does this and illustrates the calculation with Normal curves - Enter the information from Ex 15.7 into the applet: hypotheses, population standard deviation σ = 1, sample size n = 10, significance level α = 0.05, and alternative μ = 0.8 - Click "UPDATE." The applet output appears in Figure 15.2.
The two Normal curves in Figure 15.2 show the sampling distribution of M under the null hypothesis μ = 0 (top) and under the specific alternative μ = 0.8 (bottom) - The curves have the same shape because σ/√n does not change The shaded region at the right of the top curve has area 0.05 - It marks off values of M that are statistically significant at the α = 0.05 level The lower curve shows the probability of these same values when μ = 0.8 - This area is the power, 0.812 The power of the test against the specific alternative μ = 0.8 is 0.812 - That is, the test will reject H₀ about 81% of the time when this alternative is true - Therefore, 10 observations are too few to give power 90% Some software programs automate the process of selecting the desired sample size for a given power - Figure 15.3 gives Minitab's output for the one-sided z test aiming to have power 0.9 against several specific alternatives at the 5% significance level when the population standard deviation is σ = 1 - In this output, "Difference" is the difference between the null hypothesis value μ = 0 and the alternative we want to detect - In other words, it is the effect size - The "Sample Size" column shows the smallest number of observations needed for power 0.9 against each effect size
Power, Type I Error, & Type II Error in Hypothesis Tests
We can assess the performance of a test by giving 2 probabilities: the 1. Significance Level α - The probability of making the wrong decision when the null hypothesis is true - The probability of Type I Error 2. Power for an alternative that we want to be able to detect - The probability of making the right decision when that alternative is true - That of a test against any specific alternative is 1 minus the probability of a Type II Error for that alternative: power = 1 - β We can just as well describe the test by giving the probability of a wrong decision under both conditions The possibilities are summed up in Figure 15.4 - When H₀ is true, our decision is correct if we fail to reject H₀, but it is a Type I Error if we reject H₀ - There is only one possible value for the parameter of interest when H₀ is true, so the probability of a Type I Error is just the significance level α, regardless of any other factor - When Hₐ is true, our decision is correct if we reject H₀, but it is a Type II Error if we fail to reject H₀ - However, if H₀ is not true, then we do not have an exact value for the parameter of interest; it could be anything consistent with Hₐ - This is why the probability of a Type II Error is different for different "effect sizes"
Exercise 15.37: Suppose we know that the aspirin content of aspirin tablets in Ex 14.6 (page 357) follows a Normal distribution with standard deviation σ = 5 mg a. You must verify the aspirin content of tablets produced in a day to within ± 1 mg with 99% confidence - How large a sample of aspirin tablets from the daily production do you need? b. If you needed to obtain only a 95% confidence interval with a margin of error no larger than 1 mg, how large a sample of aspirin tablets from the daily production would you need instead?
a) 166 - n = [(z*)(σ)/m]² - (2.575)²(5)²/(1)² - (6.631)(25)/(1) - 165.77/1 - 165.8 ≈ 166 b) 97 - n = [(z*)(σ)/m]² - (1.96)(5)²/(1)² - (3.842)(25)/(1) - 96.04/1 - 96.04 ≈ 97
Apply Your Knowledge 15.15: Exercise 14.5 (page 356) described the manufacturing process of a pharmaceutical product - Repeated measurements follow a Normal distribution with mean μ equal to the true product concentration, and with standard deviation σ = 0.0068 g/l a. How many measurements would be needed to estimate the true concentration within ± 0.001 g/l with 95% confidence? b. How many measurements would be needed to estimate the true concentration within ± 0.001 g/l with 99% confidence? - What is the implication of choosing a higher confidence level?
a) 178 - n = [(z*)(σ)/m]² - (1.96)²(0.0068)²/(0.001)² - (3.842)(0.00004624)/0.000001 - 0.00001776/0.000001 - 177.6 ≈ 178 b) 307 - n = [(z*)(σ)/m]² - (2.575)²(0.0068)²/(0.001)² - (6.631)(0.00004624)/0.000001 - 0.0003066/0.000001 - 306.6 ≈ 307 A higher confidence level requires a larger sample size
Apply Your Knowledge 15.5: Consider a variable known to be Normally distributed with unknown mean μ and known standard deviation σ = 10 a. What would be the margin of error of a 95% confidence interval for the population mean μ based on a random sample of size 25? - The multiplier for a z confidence interval with a 95% confidence level is the critical value z* = 1.960 b. What would be the margin of error of a 95% confidence interval for μ based on a random sample of size 100? - Based on a random sample of size 400? c. Compare the margins of error for samples of size 25, 100, and 400 - How does increasing the sample size change the margin of error of a confidence interval?
a) 3.92 - m = z*(σ/√n) - (1.96)(10/√25) - (1.96)(10/5) - (1.96)(2) - 3.92 b) 1.96 (for n = 100) and 0.98 (for n = 400) m = z*(σ/√n) - (1.96)(10/√100) - (1.96)(10/10) - (1.96)(1) - 1.96 m = z*(σ/√n) - (1.96)(10/√400) - (1.96)(10/20) - (1.96)(0.5) - 0.98 c) As the sample size increases, the margin of error decreases
Apply Your Knowledge 15.9: Your company markets a computerized medical diagnostic program used to evaluate thousands of people - The program scans the results of routine medical tests (e.g., pulse rate and blood tests) and refers the case to a doctor if if finds evidence of a medical problem - The program makes a decision about each person a. What are the null and alternative hypotheses and the two types of error that the program can make? - Describe the two types of error in terms of "false-positive" and "false-negative" test results b. The program can be adjusted to decrease one error probability, at the cost of an increase in the other error probability - Which error probability would you choose to make smaller, and why? - (This is a matter of judgment; there is no single correct answer)
a) H₀: patient is healthy, Hₐ: patient is ill - Type I error: sending a healthy patient to the doctor - Type II error: clearing a patient who is ill
Exercise 15.43: The National Center for Health Statistics reports that the systolic blood pressure for males 35 to 44 years of age has mean 128 mm Hg and standard deviation 15 mm Hg The medical director of a large company looks at the medical records of 72 executives in this age group and finds that the mean systolic blood pressure in this sample is M = 126.07 mm Hg At a 5% significance level, the data fail to show significant evidence that the mean blood pressure of a population of executives differs from the national mean μ = 128 mm Hg - The medical director now wonders if the test used would detect an important difference if one were present a. State the null and alternative hypotheses in mathematical form b. Use the Statistical Power applet to find the power of the two-sided test against the alternative μ = 134 mm Hg c. Use the applet to calculate the power of the test against the alternative μ = 122 mm Hg - Can the test be relied on to detect a mean that differs from 128 mm Hg by 6 mm Hg? d. If the alternative were farther from H₀, say, μ = 136 mm Hg, would the power be higher or lower than the values calculated in parts b and c?
a) H₀: μ = 128 - Hₐ: μ ≠ 128 b) Power = 0.924 c) Power = 0.924 - Yes d) The power would be higher
Apply Your Knowledge 15.11: A researcher looking for evidence of extrasensory perception (ESP) tests 500 subjects - 4 of these subjects do significantly better (P < 0.01) than random guessing a. Is it proper to conclude that these 4 people have ESP? - Explain your answer b. What should the researcher now do to test whether any of these 4 subjects have ESP?
a) No; since we are running the test so many times, we would expect about 5 individuals to get P ≤ 0.01 b) Re-administer the ESP test to these 4 individuals to see if their performance is confirmed
Apply Your Knowledge 15.1: The New England Journal of Medicine posts its peer-reviewed articles and editorials on its website - An opt-in poll was featured next to an editorial about the regulation of sugar-sweetened beverages - The poll asked, "Do you support government regulation of sugar-sweetened beverages?" - Readers just needed to click on a response (yes or no) to become part of the sample - The poll stayed open for several weeks in October 2012 - Of the 1290 votes cast, 864 were "yes" responses a. Would it be reasonable to calculate from these data a confidence interval for the percent answering "yes" in the American population? - Explain your answer. b. Would it be reasonable to calculate from these data a confidence interval for the percent answering "yes" among online readers of the New England Journal of Medicine? - Explain your answer
a) No; the poll is a voluntary survey b) No; inference is still inappropriate, because the poll is a voluntary survey
Apply Your Knowledge 15.17: Consider two distinct situations a. A statistical test at significance level α = 0.05 has power 0.78 - What are the probabilities of Type I and Type II errors for this test? b. A statistical test at the α = 0.01 level has probability 0.14 of making a Type II error when a specific alternative is true - What is the power of the test against this alternative?
a) P(Type I error) = 0.05 - P(Type II error) = 1 - 0.78 = 0.22 b) power = 1 - β - 1 - 0.14 - 0.86
Exercise 15.33: The widespread addition of fluoride to tap water in the United States and other countries has been credited with a drastic decrease in tooth decay - However, there is always some concern over possible harmful effects of any kind of treatment Most studies so far have failed to show a negative impact of water fluoridation on health, but one study did report a significant decrease in IQ among individuals exposed to extra fluoride as children compared with individuals not exposed to this treatment - Commenting on this study in a news article on National Public Radio, Dr. Myron Allukian of the Harvard School of Dental Medicine said that the effect on IQ was small, accounting for only about a half-point difference - He further explained that a half-point difference in IQ is meaningless, just like a difference in adult heights of half a millimeter a. Why is it important to know the size of an effect and not just that the effect was statistically significant? b. How can an effect be both small and statistically significant?
a) Statistical significance is not the same as practical significance b) A significance test performed with a large sample size can be statistically significant even if the effect is small (high power)
Exercise 15.35: Statisticians prefer reasonably large samples Describe briefly the effect of increasing the size of a sample (or the number of subjects in an experiment), if all facts about the population remain unchanged, on each of the following: a. The margin of error of a 95% confidence interval b. The P-value of a test, when H₀ is false c. The impact of an outlier in the sample data on a confidence interval or a P-value
a) The margin of error decreases b) The P-value gets smaller c) The outlier will have a greater effect on a small sample
Exercise 15.39: Valium (diazepam) is a widely prescribed antidepressant and sedative A study investigated how Valium works by comparing its effect on sleep in 7 knock-in mice, in which the (a2)GABA(A) receptor is insensitive to Valium, and in 8 wild-type control mice - The study found that Valium reduced sleep latency in both groups, with no significant difference between the groups According to the study authors, this lack of significance "is related to the large inter-individual variability that is also reflected in the low power (20%) of the test" a. Explain in simple language why tests having low power often fail to give evidence against a null hypothesis even when the null hypothesis is really false b. Which aspects of this experiment most likely contributed to the low test power?
a) There are not enough data to determine whether the outcome was due to chance under the null hypothesis b) Small sample size or large variability
Apply Your Knowledge 15.7: Emissions of sulfur dioxide by industry set off chemical changes in the atmosphere that result in "acid rain" - The acidity of liquids is measured by pH on a scale of 0 to 14 - Distilled water has pH 7.0, and lower pH values indicate acidity - Typical rain is somewhat acidic, so acid rain is defined as rainfall with a pH below 5.0 Suppose that pH measurements of rainfall on different days in a Canadian forest follow a Normal distribution with standard deviation σ = 0.5 A sample of n days finds that the mean pH is M = 4.8 Is this good evidence that the mean pH μ for all rainy days is less than 5.0? The answer depends on the size of the sample - Use the P-Value of a Test of Significance applet or technology for your computations a. Enter H₀: μ = 5, Hₐ: μ < 5, σ = 0.5, and M = 4.8 - Then enter n = 5, n = 15, and n = 40 one after the other, clicking "UPDATE" each time to get the three P-values - What are they? b. Sketch the three Normal curves displayed by the applet, with M = 4.8 marked on each curve - Explain why the P-value of the same result M = 4.8 is smaller (more statistically significant) for larger sample sizes c. The corresponding 95% confidence intervals for the mean pH μ would be 4.36 to 5.24 (for n = 5), 4.55 to 5.05 (for n = 15), and 4.65 to 4.95 (for n = 40) - What information is conveyed by a confidence interval that we can't learn from the P-value alone?
a) n = 5; P = 0.1855 - n = 15; P = 0.0607 - n = 40; P = 0.0057 b) As the sample size increases, the sampling distribution gets narrower c) Effect size (an estimate of the true value of μ) rather than simply whether μ differs from a reference value
