Biostatistics all chapters
SPSS Data Table
SPSS Data Table • Most computer programs require data in two columns • One column is for the explanatory variable (group) • One column is for the response variable (hrt_rate) • See next slide for Figure 13-1 John Tukey taught us the importance of exploratory data analysis (EDA) • EDA techniques that apply: - Stemplots - Boxplots -Dotplots The Problem of Multiple Comparisons • Consider a comparison of three groups. There are three possible t tests when considering three groups: (1) H0 : μ1 = μ2 versus Ha : μ1 ≠ μ2 (2) H0 : μ1 = μ3 versus Ha : μ1 ≠ μ3 (3) H0 : μ2 = μ3 versus Ha : μ2 ≠ μ3 • However, we do not perform separate t tests without modification → this would identify too many random differences Problem of Multiple Comparisons • Family-wise error rate = probability of at least one false rejection of H0 • Assume three null hypotheses are true: At α = 0.05, the Pr(retain all three H0 s) = (1−0.05)3 = 0.857. Therefore, Pr(reject at least one) = 1−0.847 = 0.143 this is the family-wise error rate. • The family-wise error rate is much greater than intended. This is "The Problem of Multiple Comparisons" Problem of Multiple Comparisons • The more comparisons you make, the greater the family-wise error rate. • This table demonstrates the magnitude of the problem Mitigating the Problem of Multiple Comparisons Two-step approach: 1. Test for overall significance using a technique called "Analysis of Variance" 2. Do post hoc comparison on individual groups 13.3 Analysis of Variance • One-way ANalysis Of VAriance (ANOVA) - Categorical explanatory variable - Quantitative response variable - Test group means for a significant difference • Statistical hypotheses - H0 : μ1 = μ2 = ... = μk - Ha : at least one of the μi s differ • Method: compare variability between groups to variability within groups (F statistic) 1+ (15)(91.325 - 82.444)2 + (15)(82.524 - 82.444)2 ] = 2387.671 • dfB = 3 - 1 = 2 • MSB = 2387.671 / 2 = 1193.836 Fstat and P-value • The Fstat has numerator and denominator degrees of freedom: df1 and df2 respectively (corresponding to dfB and dfW) • Convert Fstat to P-value with a computer program or Table D • The P-value corresponds to the area in the right tail beyond ANOVA Example (Summary) A. Hypotheses: H0 : μ1 = μ2 = μ3 vs. Ha : at least one of the μi s differ B. Statistics: Fstat = 14.08 with 2 and 42 degrees of freedom C. P-value = .000021 (via SPSS), providing highly significant evidence against the H0 ; conclude the heart rates (an indicator of the effects of stress) differed in the groups D. Significance level (optional): Results are significantly at α = .00005 Figure 13.11 ANOVA table produced with SPSS for Windows, pets and stress illustrative data • Because of the complexity of computations, ANOVA statistics are often calculated by computer • Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint courtesy of International Business Machines Corporation. ANOVA and the t test (Optional) • ANOVA for two groups is equivalent to the equal variance (pooled) t test (§12.4 ANOVA and the t test (Optional) • ANOVA for two groups is equivalent to the equal variance (pooled) t test (§12.4) • Both address H0 : μ1 = μ2 • dfW = df for t test = N - 2 • MSW = s2 pooled • Fstat = (tstat) 2 • F1,df2,α = (tdf,1-α/2) 2 1 13.4 Post Hoc Comparisons • ANOVA Ha says "at least one population mean differs" but does not delineate which differ. • Post hoc comparisons are pursued after rejection of the ANOVA H0 to delineate differences Bonferroni Procedure The Bonferroni procedure is instituted by multiplying the P-value from the LSD procedure by the number of post hoc comparisons "c". A.Hypotheses. H0 : μ1 = μ2 against Ha : μ1 ≠ μ2 B.Test statistic. Same as for the LSD method. C.P-value. The LSD method produced P = .0000039 (two-tailed). Since there were three post hoc comparisons, PBonf = 3 × .0000039 = .000012. Bonferroni Confidence Interval • Let c represent the number of post hoc comparisons. Comparing Group 1 to Group 2: ( 26.3, 9.4) 17.842 (2.51) (3.362) 15 1 15 1 (73.483 91.325) ( ) 84.793 1 1 95% CI for ( ) 42 , .9917 ,1 2
chapter 2
Types of Studies • Surveys: describe population characteristics (e.g., a study of the prevalence of hypertension in a population) §2.1 • Comparative studies: determine relationships between variables (e.g., a study to address whether weight gain causes hypertension) §2.2 2.1 Surveys • Goal: to describe population characteristics • Studies a subset (sample) of the population • Uses sample to make inferences about population • Sampling: - Saves time - Saves money - Allows resources to be devoted to greater scope and accuracy Illustrative Example: Youth Risk Behavior Surveillance (YRBS) YRBS monitors health behaviors in youth and young adults in the US. Six categories of health-risk behaviors are monitored. These include: 1. Behaviors that contribute to unintentional injuries and violence; 2. tobacco use; 3. alcohol and drug use; 4. sexual behaviors; 5. unhealthy dietary behaviors; and 6. physical activity levels and body weight. Illustrative Example: Youth Risk Behavior Surveillance (YRBS) The 2003 report used information from 15,240 questionnaires completed at 158 schools to infer health-risk behaviors for the public and private school student populations of the United States and District of Columbia.a The 15,240 students who completed the questionnaires comprise the sample. This information is used to infer the characteristics of the several million public and private school students in the United States for the period in question. aGrunbaum, J. A., Kann, L., Kinchen, S., Ross, J., Hawkins, J., Lowry, R., et al. (2004). Youth risk behavior surveillance—United States, 2003. MMWR Surveillance Summary, 53(2), 1-96. Simple Random Sampling • Probability samples entail chance in the selection of individuals • This allows for generalizations to population • The most fundamental type of probability sample is the simple random sample (SRS) • SRS (defined): an SRS of size n is selected so that all possible combinations of n individuals from the population are equally likely to comprise the sample, SRSs demonstrate sampling independence Simple Random Sampling Method 1. Number population members 1, 2, . . ., N 2. Pick an arbitrary spot in the random digit table (Table A) 3. Go down rows or columns of table to select n appropriate tuples of digits (discard inappropriate tuples) • Alternatively, use a random number generator (e.g., www.random.org) to generate n random numbers between 1 and N. Illustrative Example: Selecting a simple random sample Suppose a high school population has 600 students and you want to choose three students at random from this population. To select an SRS of n = 3: 1. Get a roster of the school. Assign each student a unique identifier 1 through 600. 2. Enter Table A at (say) line 15. Line 15 starts with these digits: 76931 95289 55809 19381 56686 Illustrative Example: Selecting a simple random sample 3. The first six triplets of numbers in this line are 769, 319, 528, 955, 809, and 193. 4. The first triplet (769) is excluded because there is no individual with that number in the population. The next two triplets (319 and 528) identify the first two students to enter the sample. The next two triplets (955 and 809) are not relevant. The last student to enter the sample is student 193. The final sample is composed of students with the IDs 319, 528, and 193. Cautions when Sampling • Undercoverage: groups in the source population are left out or underrepresented in the population list used to select the sample • Volunteer bias: occurs when self-selected participants are atypical of the source population • Nonresponse bias: occurs when a large percentage of selected individuals refuse to participate or cannot be contacted Other Types of Probability Samples • Stratified random samples • Cluster samples • Multistage sampling • These are advanced techniques not generally covered in introductory courses. §2.2 Comparative Studies • Comparative designs study the relationship between an explanatory variable and response variable. • Comparative studies may be experimental or nonexperimental. • In experimental designs, the investigator assign the subjects to groups according to the explanatory variable (e.g., exposed and unexposed groups) • In nonexperimental designs, the investigator does not assign subjects into groups; individuals are merely classified as "exposed" or "non-exposed." Figure 2.1 Experimental and nonexperimental study designs Example of an Experimental Design • The Women's Health Initiative study randomly assigned about half its subjects to a group that received hormone replacement therapy (HRT). • Subjects were followed for ~5 years to ascertain various health outcomes, including heart attacks, strokes, the occurrence of breast cancer and so on. Example of a Nonexperimental Design • The Nurse's Health study classified individuals according to whether they received HRT. • Subjects were followed for ~5 years to ascertain the occurrence of various health outcomes. Comparison of Experimental and Nonexperimental Designs • In both the experimental (WHI) study and nonexperimental (Nurse's Health) study, the relationship between HRT (explanatory variable) and various health outcomes (response variables) was studied. • In the experimental design, the investigators controlled who was and who was not exposed. • In the nonexperimental design, the study subjects (or their physicians) decided on whether or not subjects were exposed. Let's focus on selected experimental design concepts and techniques • Experimental designs provide a paradigm for nonexperimental designs. Jargon • A subject ≡ an individual participating in the experiment • A factor ≡ an explanatory variable being studied; experiments may address the effect of multiple factors • A treatment ≡ a specific set of factors Illustrative Example: Hypertension Trial • A trial looked at two explanatory factors in the treatment of hypertension. • Factor A was a health-education program aimed at increasing physical activity, improving diet, and lowering body weight. This factor had two levels: active treatment or passive treatment. • Factor B was pharmaceutical treatments at three levels: medication A, medication B, and placebo. Illustrative Example: Hypertension Trial • Because there were two levels of the healtheducation variable and three levels of pharmacological variable, the experiment evaluated six treatments, as shown in Table 2.2. • The response variable was "change in systolic blood pressure" after six months. One hundred and twenty subjects were studied in total, with equal numbers assigned to each group. • Figure 2.3 is a schematic of the study design. Table 2.2 Hypertension treatment trial with two factors and six treatments • Subjects = 100 individuals who participated in the study • Factor A = Health education (active, passive) • Factor B = Medication (Rx A, Rx B, or placebo) • Treatments = the six specific combinations of factor A and factor B Figure 2.3 Study design outline, hypertensive treatment trial illustrative example Three Important Experimentation Principles: • Controlled comparison • Randomized • Blinded "Controlled" Trail • The term "controlled" in this context means there is a non-exposed "control group" • Having a control group is essential because the effects of a treatment can be judged only in relation to what would happen in its absence • You cannot judge effects of a treatment without a control group because: - Many factors contribute to a response - Conditions change on their own over time - The placebo effect and other passive intervention effects are operative Randomization • Randomization is the second principle of experimentation • Randomization refers to the use of chance mechanisms to assign treatments • Randomization balances lurking variables among treatments groups, mitigating their potentially confounding effects Randomization - Example Consider this study (JAMA 1994;271: 595-600) • Explanatory variable: Nicotine or placebo patch • 60 subjects (30 each group) • Response: Cessation of smoking (yes/no) Random Assignment Group 1 30 smokers Treatment 1 Nicotine Patch Compare Cessation rates Group 2 30 smokers Treatment 2 Placebo Patch Randomization - Example • Number subjects 01,...,60 • Use Table A (or a random number generator) to select 30 two-tuples between 01 and 60 • If you use Table A, arbitrarily select a different starting point each time • For example, if we start in line 19, we see 04247 38798 73286 Randomization, cont. • We identify random two-tuples, e.g., 04, 24, 73, 87, etc. • Random two-tuples greater than 60 are ignored • The first three individuals in the treatment group are 01, 24, and 29 • Keep selecting random two-tuples until you identify 30 unique individuals • The remaining subjects are assigned to the control group Blinding • Blinding is the third principle of experimentation • Blinding refers to the measurement of the response of a response made without knowledge of treatment type • Blinding is necessary to prevent differential misclassification of the response • Blinding can occur at several levels of a study designs - Single blinding - subjects are unaware of specific treatment they are receiving - Double blinding - subjects and investigators are blinded Ethics • Informed consent • Beneficence • Equipoise • Independent (IRB) over-sight
Measurement
how we get our data the assigning of numbers or codes according to prior set rules. positioning observations along a numerical continuum or classifying observations into categories. pg4
Inaccuracies
imprecision and bias pg 10 two forms of measurement errors
Types of mesurement scales
pg 7 categorical, ordinal, and quantitative. each scale will take on the assumptions of the prior type and adds further restrictions. categorical measurements place observations into classes or groups. examples on pg 7. they are also called nominal variables or named variables attribute variables and qualitative variables Ordinal measurements assign observations into categories that can be put into rank order. example on pg 7 example stages of cancer Quantitative measurements position observations along a meaningful numeric scale. exapmples are chronological age (years) body weight (pounds) systolic blood pressure 9mmHg) and serum glucose (mmol/L) Some statistical sources use terms such as ration/interval measurement, the numeric variable, scale variable and continuous variable to refer to quantitative measurements.
biostatistics
the discipline concerned with the treatment and analysis of numerical data derived from biological, biomedical and health-related studies. servant of the sciences. is more than just a compilation of computational techniques. a way to detect patterns and judge responses. goals include improvement of the intellectual content of the data, organization of data into understandable forms, reliance on tests of experience as a standard of validity
chapter 6
A probability density functions (pdf) is a mathematical relation that assigns probabilities to all possible outcomes for a continuous random variable. • The pdf for our random spinner is shown on the next slide. • The shaded area under the curve represents probability, in this instance: Pr(0 ≤ X ≤ 0.5) = 0.5 Figure 5.6 Probability curve (pdf ) for Random Number Spinner showing Pr(0.0 ≤ X ≤ 0.5). Examples of pdfs • pdfs obey all the rules of probabilities • pdfs come in many forms (shapes). Here are some examples: • The most common pdf is the Normal. (We study the Normal pdf in detail in the next chapter.) Uniform pdf Normal pdf Chi-square pdf Exercise 5.13 pdf Area Under the Curve • As was the case with pmfs, pdfs display probability with the area under the curve (AUC) • This histogram shades bars corresponding to ages ≤ 9 (~40% of histogram) • This shaded AUC on the Normal pdf curve also corresponds to ~40% of total. 2 2 1 2 1 ( ) x f x e §5.5: More Rules and Properties of Probability • Independent Events - Events A and B are independent if and only if Pr(A and B) = Pr(A) × Pr(B) - For A and B to be independent, their joint probability must overlap just the right amount. • General Rule of Addition: - The general additional rule is: - Pr(A or B) = Pr(A) + Pr(B) - Pr(A and B) Rules • Conditional Probability - Let Pr(B | A) represent the conditional probability of B given A. This denotes the probability of B given that A is evident. By definition, - as long as Pr(A) > 0. Pr(A) Pr(A and B) Pr(B| A) Rules • General Rule for Multiplication - Start with the definition of conditional probability: - then rearrange the formula as follows: - This is the general rule for multiplication. Pr(A) Pr(A and B) Pr(B| A) PrA and B PrAPrB| A Rules • Rule: Bayes' Theorem • P rB| A P rA P rB| A P rA
Chapter 11
In Chapter 11: 11.1 Estimated Standard Error of the Mean 11.2 Student's t Distribution 11.3 One-Sample t Test 11.4 Confidence Interval for μ 11.5 Paired Samples 11.6 Conditions for Inference 11.7 Sample Size and Power §11.1 Estimated Standard Error of the Mean • We rarely know population standard deviation σ instead, we calculate sample standard deviations s and use this as an estimate of σ • We then use s to calculate this estimated standard error of the mean: • Using s instead of σ adds a source of uncertainty z procedures no longer apply use t procedures instead n s SEx §11.2 Student's t distributions • A family of distributions identified by "Student" (William Sealy Gosset) in 1908 • t family members are identified by their degrees of freedom, df. • t distributions are similar to z distributions but with broader tails • As df increases → t tails get skinnier → t become more like z Figure 11.1 t probability density functions with 1, 9, and ∞ degrees of freedom. t table (Table C) • Use Table C to look up t values and probabilities - Entries t values - Rows df - Columns probabilities Understanding Table C Let tdf,p ≡ a t value with df degrees of freedom and cumulative probability p. For example, t9,.90 = 1.383 Table C. Traditional t table Cumulative p 0.75 0.80 0.85 0.90 0.95 0.975 Upper-tail p 0.25 0.20 0.15 0.10 0.05 0.025 df = 9 0.703 0.883 1.100 1.383 1.833 2.262 Figure 11.3 The 10th and 90th percentiles on t9 . Left tail: Pr(T9 < -1.383) = 0.10 Right tail: Pr(T9 > 1.383) = 0.10 §11.3 One-Sample t Test A. Hypotheses. H0 : µ = µ0 vs. Ha : µ ≠ µ0 (two-sided) [ Ha : µ < µ0 (left-sided) or Ha : µ > µ0 (right-sided)] B. Test statistic. C. P-value. Convert tstat to P-value [table C or software]. Small P strong evidence against H0 D. Significance level (optional). See Ch 9 for guidelines. with 1 0 stat df n s n x t One-Sample t Test: Statement of the Problem • Do SIDS babies have lower than average birth weights? • We know from prior research that the mean birth weight of the non-SIDs babies in this population is 3300 grams • We study n = 10 SIDS babies, determine their birth weights, and calculate x-bar = 2890.5 and s = 720. • Do these data provide significant evidence that SIDs babies have different birth weights than the rest of the population? One-Sample t Test: Example A. H0 : µ = 3300 versus Ha : µ ≠ 3300 (two-sided) B. Test statistic C. P = 0.1054 [next slide] Weak evidence against H0 D. (optional) Data are not significant at α = .10 1 10 1 9 1.80 720 10 0 2890.5 3300 stat df n SE x t x Converting the tstat to a P-value tstat P-value via Table C. Wedge |tstat| between critical value landmarks on Table C. One-tailed 0.05 < P < 0.10 and two-tailed 0.10 < P < 0.20. tstat P-value via software. Use a software utility to determine that a t of −1.80 with 9 df has two-tails of 0.1054. Table C. Traditional t table Cumulative p 0.75 0.80 0.85 0.90 0.95 0.975 Upper-tail p 0.25 0.20 0.15 0.10 0.05 0.025 df = 9 0.703 0.883 1.100 1.383 1.833 2.262 |tstat| = 1.80 Figure 11.4 Two-tailed P-value, SIDS illustrative example §11.4 Confidence Interval for µ • Typical point "estimate ± margin of error" formula • tn-1,1-α/2 is from t table (see bottom row for conf. level) • Similar to z procedure except uses s instead of σ • Similar to z procedure except uses t instead of z • Alternative formula: n s x t n 2 1,1 (1 )100% CI for n s x t n SEx SEx where 2 1,1 Confidence Interval: Example 1 = (2375.4 to 3405.6) grams 2890.5 ±515.1 10 720 2890.5 2.262 95% CI for 2890.5 720.0 10 2 .0 5 10 1,1 n s x t x s n Let us calculate a 95% confidence interval for μ for the birth weight of SIDS babies. Confidence Interval: Example 2 Data are "% of ideal body weight" in 18 diabetics: {107, 119, 99, 114, 120, 104, 88, 114, 124, 116, 101, 121, 152, 100, 125, 114, 95, 117}. Based on these data we calculate a 95% CI for μ. 112.778 ± 7.17 = (105.6,120.0) ( ) ( ) 112.778 (2.110) (3.44) 2.110 (from table) 3.400 18 14.242 112.778 14.424 18 2 2 .0 5 2 1,1 1,1 18 1,1 17, .975 n x n x x t SE t t t t n s SE x s n §11.5 Paired Samples • Paired samples: Each point in one sample is matched to a unique point in the other sample • Pairs be achieved via sequential samples within individuals (e.g., pre-test/post-test), cross-over trials, and match procedures • Also called "matched-pairs" and "dependent samples" Example: Paired Samples • A study addresses whether oat bran reduce LDL cholesterol with a cross-over design. • Subjects "cross-over" from a cornflake diet to an oat bran diet. - Half subjects start on CORNFLK, half on OATBRAN - Two weeks on diet 1 - Measures LDL cholesterol - Washout period - Switch diet - Two weeks on diet 2 - Measures LDL cholesterol Example, Data Subject CORNFLK OATBRAN ---- ------- ------- 1 4.61 3.84 2 6.42 5.57 3 5.40 5.85 4 4.54 4.80 5 3.98 3.68 6 3.82 2.96 7 5.01 4.41 8 4.34 3.72 9 3.80 3.49 10 4.56 3.84 11 5.35 5.26 12 3.89 3.73 Calculate Difference Variable "DELTA" • Step 1 is to create difference variable "DELTA" • Let DELTA = CORNFLK - OATBRAN • Order of subtraction does not materially effect results (but but does change sign of differences) • Here are the first three observations: Positive values represent lower LDL on oatbran ID CORNFLK OATBRAN DELTA ---- ------- ------- ----- 1 4.61 3.84 0.77 2 6.42 5.57 0.85 3 5.40 5.85 -0.45 ↓ ↓ ↓ ↓ Explore DELTA Values Stemplot |-0|24 |+0|0133 |+0|667788 ×1 Here are all the twelve paired differences (DELTAs): 0.77, 0.85, −0.45, −0.26, 0.30, 0.86, 0.60, 0.62, 0.31, 0.72, 0.09, 0.16 EDA shows a slight negative skew, a median of about 0.45, with results varying from −0.4 to 0.8. Descriptive stats for DELTA • Data (DELTAs): 0.77, 0.85, −0.45, −0.26, 0.30, 0.86, 0.60, 0.62, 0.31, 0.72, 0.09, 0.16 • The subscript d will be used to denote statistics for difference variable DELTA 0.4335 0.3808 12 d d s x n 95% Confidence Interval for µd • A t procedure directed toward the DELTA variable calculates the confidence interval for the mean difference. • "Oat bran" data: n s x t d d d n 2 1,1 (1 )100% CI for (0.105 to 0.656) 0.3808 0.2754 1 2 .4335 9 5% CI for 0.3808 2.201 For 95%confidenceuse 2.201(fromTable C) 12 1 1 11, .975 2 0 5 d , t . t Paired t Test • Similar to one-sample t test • μ0 is usually set to 0, representing "no mean difference", i.e., H0 : μ = 0 • Test statistic: df n s n x t d d 1 0 stat Paired t Test: Example "Oat bran" data A. Hypotheses. H0 : µd = 0 vs. Ha : µd 0 B. Test statistic. C. P-value. P = 0.011 (via computer). The evidence against H0 is statistically significant. D. Significance level (optional). The evidence against H0 is significant at α = .05 but is not significant at α = .01 1 12 1 11 3.043 .4335/ 12 0 0.38083 0 stat df n s n x t d SPSS Output: Oat Bran data §11.6 Conditions for Inference t procedures require these conditions: • SRS (individual observations or DELTAs) • Valid information (no information bias) • Normal population or large sample (central limit theorem) The Normality Condition • The Normality condition applies to the sampling distribution of the mean, not the population. • Therefore, it is OK to use t procedures when: - The population is Normal - Population is not Normal but is symmetrical and n is at least 5 to 10 - The population is skewed and the n is at least 30 to 100 (depending on the extent of the skew) Can a t procedures be used? • Dataset A is skewed and small: avoid t procedures • Dataset B has a mild skew and is moderate in size: use t procedures • Data set C is highly skewed and is small: avoid t procedure §11.7 Sample Size and Power • Questions: - How big a sample is needed to limit the margin of error to m? - How big a sample is needed to test H0 with 1−β power at significance level α? - What is the power of a test given certain conditions? Sample Size for a Confidence Interval • A (1 - α)100% confidence interval for (or ) to m, the sample size should be no less than 2 1 2 m n z d • where - α ≡ desired significance level - m ≡ is the desired margin of error - σ ≡ population standard deviation • when n ≥ 30 because • When n < 30, Apply adjustment factor f = (df + 3)/(df + 1) to compensate the between z and t. 30,1 / 2 1 2 t z Sample Size Requirements where 1 - β ≡ desired power of the test α ≡ desired significance level σ ≡ population standard deviation Δ = μ0 - μa ≡ the difference worth detecting 2 2 1 1 2 2 z z n Cont. • For one-sided tests, use in place of in the formula. • For paired t-tests, use in place of σ. • Apply adjustment factor f = (df + 3)/(df + 1) when n ≤30 to compensate for the difference between z and t. 1 z 1 / 2 z DELTA Power where: • α ≡ (two-sided) alpha level of the test • Δ ≡ "the mean difference worth detecting" (i.e., the mean under the alternative hypothesis minus the mean under the null hypothesis) • n ≡ sample size • σ ≡ standard deviation in the population • Φ(z) ≡ the cumulative probability of z on a Standard Normal distribution [Table B] n z | | 1 2 1 Power: Illustrative Example, SIDS Birth Weight Consider the SIDS illustration in which n = 10 and σ is assumed to be 720 gms. Let α = 0.05 (two-sided). What is the power of a test under these conditions to detect a mean difference of 300 gms?
chapter 12
In Chapter 12: 12.1 Paired and Independent Samples 12.2 Exploratory and Descriptive Statistics 12.3 Inference About the Mean Difference 12.4 Equal Variance t Procedure (Optional) 12.5 Conditions for Inference 12.6 Sample Size and Power Sample Types (for Comparing Means) • Single sample. One group; no concurrent control group, comparisons made to external population (Ch 11) • Paired samples. Two samples w/ each data point in one sample uniquely matched to a point in the other; analyze within-pair differences (Ch 11) • Two independent samples. Two separate groups; no matching or pairing; compare separate groups Quantitative outcome One sample §11.1 - §11.4 Two samples Paired samples §11.5 Independent samples Chapter 12 What Type of Sample? 1. Measure vitamin content in loaves of bread and see if the average meets national standards 2. Compare vitamin content of bread loaves immediately after baking versus values in same loaves 3 days later 3. Compare vitamin content of bread immediately after baking versus loaves that have been on shelf for 3 days Answers 1 = single sample 2 = paired samples 3 = independent samples Illustrative Example: Cholesterol and Type A & B Personality Do fasting cholesterol levels differ in Type A and Type B personality men? Data (mg/dl) are a subset from the Western Collaborative Group Study* Group 1 (Type A personality): 233, 291, 312, 250, 246, 197, 268, 224, 239, 239, 254, 276, 234, 181, 248, 252, 202, 218, 212, 325 Group 2 (Type B personality): 344, 185, 263, 246, 224, 212, 188, 250, 148, 169, 226, 175, 242, 252, 153, 183, 137, 202, 194, 213 * Data set is documented on p. 56 in the text. SPSS Data Table • One column for the response variable (chol) • One column for the explanatory variable (group) §12.2: Exploratory & Descriptive Methods • Start with EDA • Compare group shapes, locations and spreads • Examples of applicable techniques - Side-by-side stemplots (at right) - Side-by-side boxplots (next slide) Group 1 | | Group 2 -------------------- |1t|3 |1f|45 |1s|67 98|1.|8889 110|2*|011 33332|2t|22 55544|2f|4455 76|2s|6 9|2.| 21|3*| |3t| |3f|4 (×100) Figure 12.2 Side-by-side boxplots of the cholesterol illustrative data. Side-by-Side Boxplots Interpretation: • Location: group 1 > group 2 • Spreads: group 1 < group 2 • Shapes: Both fairly symmetrical, outside values in each; no major departures from Normality Summary Statistics Group n mean std dev 1 20 245.05 36.64 2 20 210.30 48.34 §12.3 Inference About Mean Difference (Notation) Parameters (population) Group 1 N1 µ1 σ1 Group 2 N2 µ2 σ2 Statistics (sample) Group 1 n1 s1 Group 2 n2 s2 1 x 2 x 1 2 of 1 2 x x is the point estimator Standard Error of Mean Difference 2 2 2 1 2 1 1 2 n s n s SEx x How preciseis as an estimator of ? 1 2 1 2 x x Standard error of the mean difference Standard Error of Mean Difference There are two ways to estimate the degrees of freedom for this SE: • dfWelch = formula on p. 274 [calculate w/ computer] • dfconservative = the smaller of (n1 - 1) or (n2 - 1) For the cholesterol comparison data: • dfconservative = smaller of (n1-1) or (n2 - 1) = 20 - 1 = 19 3 5.4 (via SPSS) 1 3.563 2 0 4 8.340 2 0 3 6.6382 2 1 2 Welsch x x d f S E Confidence Interval for µ1-µ2 (1−α)100% confidence interval for µ1 - µ2= ( ) ( ) ( ) 1 2 2 1 2 df ,1 SEx x x x t (6.4 to 63.1) mg/dL 34.75 28.38 ( ) ( ) ( ) (245.05 210.30) (2.093) (13.563) 13.563 and 19 (prior slide) 1 2 1 2 1 2 19, .975 conserv x x x x x x t SE SE df For the cholesterol comparison data: Comparison of CI Formulas (point estimate) (t*)(S E) Type of sample point estimate df for t* SE single n - 1 paired n - 1 independent smaller of n1−1 or n2−1 1 2 x x d x x n s n sd 2 2 2 1 2 1 n s n s Hypothesis Test A. Hypotheses. H0 : μ1 = μ2 against Ha : μ1 ≠ μ2 (two-sided) [Ha : μ1 > μ2 (right-sided) Ha : μ1 < μ2 (left-sided) ] B. Test statistic. C. P-value. Convert the tstat to P-value with t table or software. Interpret. D. Significance level (optional). Compare P to prior specified α level. or (described previous slide) where ( ) Welch conserv 2 2 2 1 2 1 2 1 stat 1 2 1 2 df df n s n s SE SE x x t x x x x Hypothesis Test - Example A. Hypotheses. H0 : μ1 = μ2 vs. Ha : μ1 ≠ μ2 B. Test stat. In prior analyses we calculated sample mean difference = 34.75 mg/dL, SE = 13.563 and dfconserv = 19. C. P-value. P = 0.019 → good evidence against H0 ("significant difference"). D. Significance level (optional). The evidence against H0 is significant at α = 0.02 but not at α = 0.01. df SE x x t x x 2.56 with 19 13.563 ( ) 34.75 1 2 1 2 stat Equal variance t procedure (§12.4) Preferred method (§12.3) SPSS Output 12.4 Equal Variance t Procedure (Optional) • Also called pooled variance t procedure • Not as robust as prior method, but... • Historically important • Calculated by software programs • Leads to advanced ANOVA techniques Pooled variance procedure We start by calculating this pooled estimate of variance 1 is the variancein group and where ( ) ( ) ( ) ( ) 2 1 2 2 2 2 2 2 1 1 i i i pooled df n s i df df df s df s s • The pooled variance is used to calculate this standard error estimate: • Confidence Interval • Test statistic • All with df = df1 + df2 = (n1−1) + (n2−1) 1 1 1 2 2 1 2 n n SE s x x pooled ( ) ( )( ) 1 2 2 1 2 df ,1 SEx x x x t ( ) 1 2 1 2 stat SEx x x x t Pooled Variance t Confidence Interval (20 1) (20 1) 38 13.56 20 1 20 1 1839.623 1 2 df SEx x 34.75 27.39 (7.36,62.14) (245.05 210.30) (2.02) (13.56) 95% CI for ( ) ( ) ( ) 975 1 2 1 2 1 2 38, . SEx x x x t Group ni si xbari 1 20 36.64 245.05 2 20 48.34 210.30 Data Pooled Variance t Test (20 1) (20 1) 38 13.56 20 1 20 1 1839.623 1 2 df SEx x 0.015 2.56; 38 13.56 34.75 : 1 2 1 2 stat 0 1 2 P df SE x x t H x x Data: Group ni si xbari 1 20 36.64 245.05 2 20 48.34 210.30 §12.5 Conditions for Inference • Conditions required for t procedures: • "Validity conditions" a) Good information (no information bias) b) Good sample ("no selection bias") c) "No confounding" • "Sampling conditions" a) Independence b) Normal sampling distribution (§9.5, §11.6)
chapter 14 & 15
Quantitative response variable Y ("dependent variable") • Quantitative explanatory variable X ("independent variable") • Historically important public health data set used to illustrate techniques (Doll, 1955) - n = 11 countries - Explanatory variable = per capita cigarette consumption in 1930 (CIG1930) - Response variable = lung cancer mortality per 100,000 (LUNGCA) •4 Table 14.2 Data used for chapter illustrations. Per capita cigarette consumption in 1930 (cig1930) and lung cancer cases per 100,000 in 1950 (lungca) in 11 countries Inspect scatterplots • Form: Can the relation be described with a straight or some other type of line? • Direction: Do points tend trend upward or downward? • Strength of association: Do point adhere closely to an imaginary trend line? • Outliers (in any): Are there any striking deviations from the overall pattern? •7 Judging Correlational Strength • Correlational strength refers to the degree to which points adhere to a trend line • The eye is not a good judge of strength. • The top plot appears to show a weaker correlation than the bottom plot. However, these are plots of the same data sets. (The perception of a difference is an artifact of axes scaling. Correlation coefficient r quantifies linear relationship with a number between −1 and 1. • When all points fall on a line with an upward slope, r = 1. When all data points fall on a line with a downward slope, r = −1 • When data points trend upward, r is positive; when data points trend downward, r is negative. • The closer r is to 1 or −1, the stronger the correlation. Correlation coefficient tracks the degree to which X and Y "go together." • Recall that z scores quantify the amount a value lies above or below its mean in standard deviations units. • When z scores for X and Y track in the same direction, their products are positive and r is positive (and vice versa). Calculating r • In practice, we rely on computers and calculators to calculate r. - SPSS - Scientific and graphing calculators • I encourage my students to use these tools whenever possible. Calculating r • In practice, we rely on computers and calculators to calculate r. - SPSS - Scientific and graphing calculators • I encourage my students to use these tools whenever possible. Interpretation of r 1. Direction. The sign of r indicates the direction of the association: positive (r > 0), negative (r < 0), or no association (r ≈ 0). 2. Strength. The closer r is to 1 or −1, the stronger the association. 3. Coefficient of determination. The square of the correlation coefficient (r 2 ) is called the coefficient of determination. This statistic quantifies the proportion of the variance in Y [mathematically] "explained" by X. For the illustrative data, r = 0.737 and r 2 = 0.54. Therefore, 54% of the variance in Y is explained by X. •16 Notes, cont. 4. Reversible relationship. With correlation, it does not matter whether variable X or Y is specified as the explanatory variable; calculations come out the same either way. [This will not be true for regression.] 5. Outliers. Outliers can have a profound effect on r. This figure has an r of 0.82 that is fully accounted for by the single outlier (see next slide). •17 Figure 14.4 The calculated correlation for this data set is r 5 0.82. The single influential observation in the upper-right quadrant accounts for this large r. Correlation does not necessarily mean causation. Beware lurking variables. • A near perfect negative correlation (r = −.987) was seen between cholera mortality and elevation above sea level during a 19th century epidemic • We now know that cholera is transmitted by water. • The observed relationship between cholera and elevation was confounded by the lurking variable proximity to polluted water. • See next slide •21 Figure 14.6 Cholera mortality and elevation above sea level were strongly correlated in the 1850s (r 5 20.987), but this correlation was an artifact of confounding by the extraneous factor of "water source. Hypothesis Test • Random selection from a random scatter can result in an apparent correlation • We conduct the hypothesis test to guard against identifying too many random correlations. A.Hypotheses. Let ρ represent the population correlation coefficient. H0 : ρ = 0 vs. Ha : ρ ≠ 0 (two-sided) [or Ha : ρ > 0 (right-sided) or Ha : ρ < 0 (left-sided)] B. Test statistic C. P-value. Convert tstat to P-value with software or Conditions for Inference • Independent observations • Bivariate Normality (r can still be used descriptively when data are not bivariate Normal) •27 Figure 14.8 Bivariate Normality §14.4. Regression describes the relationship in the data with a line that predicts the average change in Y per unit X. • The best fitting line is found by minimizing the sum of squared residuals, as shown in this figure. •29 Figure 14.9 Fitted regression line and residuals, smoking and lung cancer illustrative data Analysis of Variance of the Regression Model • An ANOVA technique equivalent to the t test can also be used to test H0 : β = 0. • This technique is covered on pp. 321 - 324 in the text but is not included in this presentation. •39 Conditions for Inference • Inference about the regression line requires these conditions - Linearity - Independent observations -Normality at each level of X - Equal variance at each level of X •40 Figure 14.12 Population regression model showing Normality and homoscedasticity co Assessing Conditions •42 • The scatterplot should be visually inspected for linearity, Normality, and equal variance • Plotting the residuals from the model can be helpful in this regard. • The table on the next slide lists residuals for the illustrative data Assessing Conditions, cont. • A stemplot of the residuals show no major departures from Normality • A residual plot shows more variability at higher X values (but the data is very sparse) • See next slide •44 Figure 14.15 Residual plot for illustrative data set Residual Plots • With a little experience, you can get good at reading residual plots. • On the next three slides, see: A. An example of linearity with equal variance B. An example of linearity with unequal variance C. An example of non-linearity with equal variance •46 Simple regression considers the relation between a single explanatory variable X and response variable Y. Multiple regression considers the relation between multiple explanatory variables (X1 , X2 , ..., Xk ) and response variable Y The multiple regression model helps "adjust out" the effects of confounding factors (i.e., extraneous lurking variables that bias results). • Simple linear concepts (Chapter 14) must be mastered before tackling multiple regression Estimates for the model are derived by minimizing ∑residuals2 to come up with estimates for an intercept (denoted a) and slope (b): The standard error of the regression (sY|x) quantifies variability around the line The multiple regression population model with two explanatory variables is: where μY|x ≡ expected value of Y given values x1 and x2 α ≡ intercept parameter β1 ≡ slope parameter for X1 β2 ≡ slope parameter for X2 Multiple Regression Model, cont. Estimates for the coefficients are derived by minimizing ∑residuals2 to derive this multiple regression model: The standard error of the regression quantifies variability around the regression plane Regression Modeling • A simple regression model fits a line in twodimensional space • A multiple regression model with two explanatory variables fits a regression plane in three dimensional space • A multiple regression model w Estimates for the coefficients are derived by minimizing ∑residuals2 to derive this multiple regression model: The standard error of the regression quantifies variability around the regression plane Regression Modeling • A simple regression model fits a line in twodimensional space • A multiple regression model with two explanatory variables fits a regression plane in three dimensional space • A multiple regression model with k explanatory variables fits a regression "surface" in k + 1 dimensional space (cannot be visualized) Interpretation of Coefficients • Here's a model with two independent variables (see Figure 15.1, previous slide) • The intercept predicts where the regression plane crosses the Y axis • The slope for variable X1 predicts the change in Y per unit X1 holding X2 constant • The slope for variable X2 predicts the change in Y per unit X2 holding X1 constant 15.3 Categorical Explanatory Variables in Regression Models • Categorical explanatory (independent) variables can be fit into a regression model by converting them into 0/1 indicator variables ("dummy variables") • For binary variables, code the variable 0 for "no" and 1 for "yes" • For categorical variables with k categories, use k-1 dummy variables. Dummy Variables, cont. • In this example, the variable SMOKÉ here has three levels initially coded 0 = non-smoker, 1 = former smoker, 2 = current smoker. • To use this variable in a regression model, convert it to two dummy variables as follows: Illustrative Example: Childhood respiratory health survey • A binary explanatory variable (SMOKE) is coded 0 for non-smoker and 1 for smoker. • The response variable Forced Expiratory Volume (FEV) is measured in liters/second • The mean FEV in nonsmokers is 2.566 • The mean FEV in smokers is 3.277. • A regression model is applied (regress FEV on SMOKE) Illustrative Example, cont. • The least squares regression line is ŷ = 2.566 + 0.711X • The intercept (2.566) = the mean FEV of group 0 • The slope = the mean difference = 3.277 − 2.566 = 0.711 • tstat = 6.464 with 652 df, P ≈ 0.000 (same as equal variance t test) • The 95% CI for slope is 0.495 to 0.927 (same as the 95% CI for μ1 − μ Confounding, Illustrative Example • In our example, the children who smoked had higher mean FEV than the children who did not smoke • How can this be true given what we know about the deleteriou respiratory effects of smoking? • The answer lies in the fact that the smokers were older than the nonsmokers: AGE confounding the relationship between SMOKE and FEV • A multiple regression model can adjust for AGE in this situation Multiple Regression Coefficients, cont. • The slope coefficient associated for SMOKE is −.206, suggesting that smokers have .206 less FEV on average compared to non-smokers (after adjusting for age) • The slope coefficient for AGE is .231, suggesting that each year of age in associated with an increase of .231 FEV units on average (after adjusting for SMOKE) 15.6 Examining Regression Conditions • Conditions for multiple regression mirror those of simple regression - Linearity - Independence - Normality - Equal variance • These can be evaluated by analyzing the pattern of the residuals
Chapter One
Statistics is not merely a compilation of computational techniques • Statistics - is a way of learning from data - is concerned with all elements of study design, data collection and analysis of numerical data - does require judgment • Biostatistics is statistics applied to biological and health problems Biostatisticians are: • Data detectives - who uncover patterns and clues - This involves exploratory data analysis (EDA) and descriptive statistics • Data judges - who judge and confirm clues - This involves statistical inference Measurement • Measurement (defined): the assigning of numbers and codes according to prior-set rules (Stevens, 1946). • There are three broad types of measurements: - Categorical -Ordinal -Quantitative Measurement Scales • Categorical - classify observations into named categories, - e.g., HIV status classified as "positive" or "negative" • Ordinal - categories that can be put in rank order - e.g., Stage of cancer classified as stage I, stage II, stage III, stage IV • Quantitative - true numerical values that can be put on a number line - e.g., age (years) - e.g., Serum cholesterol (mg/dL) Illustrative Example: Weight Change and Heart Disease • This study sought to determine the effect of weight change on coronary heart disease risk. • It studied 115,818 women 30- to 55-years of age, free of CHD over 14 years. • Measurements included - Body mass index (BMI) at study entry - BMI at age 18 - CHD case onset (yes or no) Source: Willett et al., 1995 Illustrative Example (cont.) Examples of Variables • Smoker (current, former, no) • CHD onset (yes or no) • Family history of CHD (yes or no) • Non-smoker, light-smoker, moderate smoker, heavy smoker • BMI (kgs/m3 ) • Age (years) • Weight presently • Weight at age 18 Quantitative Categorical Ordinal Variable, Value, Observation • Observation the unit upon which measurements are made, can be an individual or aggregate • Variable the generic thing we measure - e.g., AGE of a person - e.g., HIV status of a person • Value a realized measurement - e.g., "27" - e.g., "positive" Figure 1.1 Four observations with five variables each Data Table AGE SEX HIV ONSET INFECT 24 M Y 12-OCT-07 Y 14 M N 30-MAY-05 Y 32 F N 11-NOV-06 N • Each row corresponds to an observation • Each column contains information on a variable • Each cell in the table contains a value Unit of observation in these data are individual regions, not individual people. Data Quality • An analysis is only as good as its data • GIGO ≡ garbage in, garbage out • Does a variable measure what it purports to? - Validity = freedom from systematic error - Objectivity = seeing things as they are without making it conform to a worldview • Consider how the wording of a question can influence validity and objectivity Choose Your Ethos • BS is manipulative and has a predetermined outcome. • Science "bends over backwards" to consider alternatives. Scientific Ethos "I cannot give any scientist of any age any better advice than this: The intensity of the conviction that a hypothesis is true has no bearing on whether it is true or not." Peter Medawar
Chapter 3
Stem-and-leaf plots (stemplots) • Always start by looking at the data with graphs and plots • Our favorite technique for looking at a single variable is the stemplot • A stemplot is a graphical technique that organizes data into a histogram-like display Stemplot Illustrative Example • Select an SRS of 10 ages • List data as an ordered array 05 11 21 24 27 28 30 42 50 52 • Divide each data point into a stem-value and leafvalue • In this example the "tens place" will be the stemvalue and the "ones place" will be the leaf value, e.g., 21 has a stem value of 2 and leaf value of 1 Stemplot illustration (cont.) • Draw an axis for the stem-values: 0| 1| 2| 3| 4| 5| ×10 axis multiplier (important!) • Place leaves next to their stem value • 21 plotted (animation) 1 Stemplot illustration continued ... • Plot all data points and rearrange in rank order: 0|5 1|1 2|1478 3|0 4|2 5|02 ×10 • Here is the plot horizontally: (for demonstration purposes) 8 7 4 2 5 1 1 0 2 0 ------------ 0 1 2 3 4 5 ------------ Rotated stemplot Interpreting Stem plots: Shape • Symmetry • Modality (number of peaks) • Kurtosis (width of tails) • Departures (outliers) Interpreting Stem plots: Location • Gravitational center mean • Middle value median Interpreting Stemplots: Spread • Range and inter-quartile range • Standard deviation and variance (Chapter 4) Shape • "Shape" refers to the pattern when plotted • Here's the "skyline silhouette" of our data X X X X X X X X X X ----------- 0 1 2 3 4 5 ----------- • Consider: symmetry, modality, kurtosis Figure 3.1 Histogram with overlying curve showing distribution's shape Figure 3.2 Examples of distributional shapes Figure 3.2 Examples of distributional shapes Figure 3.2 Examples of distributional shapes Modality (no. of peaks) Kurtosis (steepness) Mesokurtic (medium) Platykurtic (flat) Leptokurtic (steep) skinny tails fat tails Kurtosis is not be easily judged by eye Location: Mean "Eye-ball method" visualize where plot would balance Arithmetic method = sum values and divide by n 8 7 4 2 5 1 1 0 2 0 ------------ 0 1 2 3 4 5 ------------ ^ Grav.Center Eye-ball method around 25 to 30 (takes practice) Arithmetic method mean = 290 / 10 = 29 Location: Median • Ordered array: 05 11 21 24 27 28 30 42 50 52 • The median has a depth of (n + 1) ÷ 2 on the ordered array • When n is even, average the points adjacent to this depth • For illustrative data: n = 10, median's depth = (10+1) ÷ 2 = 5.5 → the median falls between 27 and 28 • See Chapter 4 for details regarding the median Spread: Range • Range = minimum to maximum • The easiest but not the best way to describe spread (better methods of describing spread are presented in the next chapter) • For the illustrative data the range is "from 5 to 52" Stemplot - Second Example • Data: 1.47, 2.06, 2.36, 3.43, 3.74, 3.78, 3.94, 4.42 • Stem = ones-place • Leaves = tenths-place • Truncate extra digit (e.g., 1.47 1.4) Do not plot decimal |1|4 |2|03 |3|4779 |4|4 (×1) • Center: between 3.4 & 3.7 (underlined) • Spread: 1.4 to 4.4 • Shape: mound, no outliers Third Illustrative Example (n = 25) • Data: {14, 17, 18, 19, 22, 22, 23, 24, 24, 26, 26, 27, 28, 29, 30, 30, 30, 31, 32, 33, 34, 34, 35, 36, 37, 38} • Regular stemplot: |1|4789 |2|223466789 |3|000123445678 (×1) • Too squished to see shape Third Illustration (n = 25), cont. • Split stem: First "1" on stem holds leaves between 0 to 4, Second "1" holds leaves between 5 to 9, and so on. • Split-stem stemplot |1|4 |1|789 |2|2234 |2|66789 |3|00012344 |3|5678 (×1) • Negative skew - now evident How many stem-values? • Start with between 4 and 12 stem-values • Trial and error: - Try different stem multiplier - Try splitting stem - Look for most informative plot Table 3.3 Body weight (pounds) of students in a class, n = 53. Data range from 100 to 260 lbs: ×100 axis multiplier only two stem-values (1×100 and 2×100) too broad ×100 axis-multiplier w/ split stem only 4 stem values might be OK(?) ×10 axis-multiplier see next slide Fourth Stemplot Example (n = 53) 10|0166 11|009 12|0034578 13|00359 14|08 15|00257 16|555 17|000255 18|000055567 19|245 20|3 21|025 22|0 23| 24| 25| 26|0 (×10) Looks good! Shape: Positive skew, high outlier (260) Location: median underlined (about 165) Spread: from 100 to 260 Quintuple-Split Stem Values 1*|0000111 1t|222222233333 1f|4455555 1s|666777777 1.|888888888999 2*|0111 2t|2 2f| 2s|6 (×100) Codes for stem values: * for leaves 0 and 1 t for leaves two and three f for leaves four and five s for leaves six and seven . for leaves eight and nine For example, this is 120: 1t|2 (x100) SPSS Stemplot Frequency Stem & Leaf 2.00 3 . 0 9.00 4 . 0000 28.00 5 . 00000000000000 37.00 6 . 000000000000000000 54.00 7 . 000000000000000000000000000 85.00 8 . 000000000000000000000000000000000000000000 94.00 9 . 00000000000000000000000000000000000000000000000 81.00 10 . 0000000000000000000000000000000000000000 90.00 11 . 000000000000000000000000000000000000000000000 57.00 12 . 0000000000000000000000000000 43.00 13 . 000000000000000000000 25.00 14 . 000000000000 19.00 15 . 000000000 13.00 16 . 000000 8.00 17 . 0000 9.00 Extremes (>=18) Stem width: 1 Each leaf: 2 case(s) SPSS provides frequency counts w/ its stemplots: Because of large n, each leaf represents 2 observations 3 . 0 means 3.0 years Frequency Table • Frequency = count • Relative frequency = proportion or % • Cumulative frequency % less than or equal to level AGE | Freq Rel.Freq Cum.Freq. ------+----------------------- 3 | 2 0.3% 0.3% 4 | 9 1.4% 1.7% 5 | 28 4.3% 6.0% 6 | 37 5.7% 11.6% 7 | 54 8.3% 19.9% 8 | 85 13.0% 32.9% 9 | 94 14.4% 47.2% 10 | 81 12.4% 59.6% 11 | 90 13.8% 73.4% 12 | 57 8.7% 82.1% 13 | 43 6.6% 88.7% 14 | 25 3.8% 92.5% 15 | 19 2.9% 95.4% 16 | 13 2.0% 97.4% 17 | 8 1.2% 98.6% 18 | 6 0.9% 99.5% 19 | 3 0.5% 100.0% ------+----------------------- Total | 654 100.0% Frequency Table with Class Intervals • When data are sparse, group data into class intervals • Create 4 to 12 class intervals • Classes can be uniform or non-uniform • End-point convention: e.g., first class interval of 0 to 10 will include 0 but exclude 10 (0 to 9.99) • Talley frequencies • Calculate relative frequency • Calculate cumulative frequency Class Intervals Class Freq Relative Freq. (%) Cumulative Freq (%) 0 - 9.99 1 10 10 10 - 19 1 10 20 20 - 29 4 40 60 30 - 39 1 10 70 40 - 44 1 10 80 50 - 59 2 20 100 Total 10 100 -- Uniform class intervals table (width 10) for data: 05 11 21 24 27 28 30 42 50 52 Histogram 0 1 2 3 4 5 0-9 10_19 20-29 30-39 40-49 50-59 Age Class A histogram is a frequency chart for a quantitative measurement. Notice how the bars touch. Bar Chart 0 50 100 150 200 250 300 350 400 450 500 Pre- Elem. Middle High School-level A bar chart with non-touching bars is reserved for categorical measurements and non-uniform class intervals
Chapter 3
Stemplots pg 42
Chapter 4-7
Summary Statistics • Central location - Mean - Median - Mode • Spread - Range and interquartile range (IQR) - Variance and standard deviation • Shape summaries - seldom used in practice Notation • n sample size • X the variable (e.g., ages of subjects) • xi the value of individual i for variable X • sum all values (capital sigma) • Illustrative data (ages of participants): 21 42 5 11 30 50 28 27 24 52 n = 10 X = AGE variable x1= 21, x2= 42, ..., x10= 52 xi = x1 + x2 + ... + x10= 21 + 42 + ... + 52 = 290 §4.1: Central Location: Sample Mean • "Arithmetic average" • Traditional measure of central location • Sum the values and divide by n • "xbar" refers to the sample mean n i n xi n x x x n x 1 1 1 1 2 Example: Sample Mean Ten individuals selected at random have the following ages: 21 42 5 11 30 50 28 27 24 52 Note that n = 10, xi = 21 + 42 + ... + 52 = 290, and (290) 2 9.0 1 0 1 1 xi n x Figure 4.1 The mean is the balancing point of a distribution Uses of the Sample Mean • The sample mean: • The value of an observation drawn at random from the sample can be used to predict the population mean Population Mean • Same operation as sample mean except based on entire population (N ≡ population size) • Conceptually important • Usually not available in practice • Sometimes referred to as the expected value i i x N N x 1 §4.2 Central Location: Median The median is the value with a depth of (n+1)/2 When n is even, average the two values that straddle a depth of (n+1)/2 For the 10 values listed below, the median has depth (10+1) / 2 = 5.5, placing it between 27 and 28. Average these two values to get median = 27.5 05 11 21 24 27 28 30 42 50 52 median Average the adjacent values: M = 27.5 More Examples of Medians • Example A: 2 4 6 Median = 4 • Example B: 2 4 6 8 Median = 5 (average of 4 and 6) • Example C: 6 2 4 Median 2 (Values must be ordered first) The Median is Robust • The median is more resistant to skews and outliers than the mean; it is more robust. • This data set has a mean of 1636: 1362 1439 1460 1614 1666 1792 1867 • Here's the same data set with a data entry error "outlier" (highlighted). This data set has a mean of 2743: 1362 1439 1460 1614 1666 1792 9867 • The median is 1614 in both instances, demonstrating its robustness in the face of outliers. §4.3: Mode • The mode is the most commonly encountered value in the dataset • This data set has a mode of 7 {4, 7, 7, 7, 8, 8, 9} • This data set has no mode {4, 6, 7, 8} (each point appears only once) • The mode is useful only in large data sets with repeating values Figure 4.4 Effect of a skew on the mean, median, and mode. Note how the mean gets pulled toward the longer tail more than the median mean = median → symmetrical distrib mean > median → positive skew mean < median → negative skew §4.5 Spread: Quartiles • Two distributions can be quite different yet can have the same mean • This data compares particulate matter in air samples (μg/m3 ) at two sites. Both sites have a mean of 36, but Site 1 exhibits much greater variability. We would miss the high pollution days if we relied solely on the mean. Site 1| |Site 2 ---------------- 42|2| 8|2| 2|3|234 86|3|6689 2|4|0 |4| |5| |5| |6| 8|6| ×10 Spread: Range • Range = maximum - minimum • Illustrative example: Site 1 range = 86 - 22 = 64 Site 2 range = 40 - 32 = 8 • Beware: the sample range will tend to underestimate the population range. • Always supplement the range with at least one addition measure of spread Site 1| |Site 2 ---------------- 42|2| 8|2| 2|3|234 86|3|6689 2|4|0 |4| |5| |5| |6| 8|6| ×10 Spread: Quartiles • Quartile 1 (Q1): cuts off bottom quarter of data = median of the lower half of the data set • Quartile 3 (Q3): cuts off top quarter of data = median of the upper half of the data set • Interquartile Range (IQR) = Q3 - Q1 covers the middle 50% of the distribution 05 11 21 24 27 28 30 42 50 52 Q1 median Q3 Q1 = 21, Q3 = 42, and IQR = 42 - 21 = 21 Quartiles (Tukey's Hinges) - Example 2 Data are metabolic rates (cal/day), n = 7 1362 1439 1460 1614 1666 1792 1867 median • When n is odd, include the median in both halves of the data set. • Bottom half: 1362 1439 1460 1614 which has a median of 1449.5 (Q1) • Top half: 1614 1666 1792 1867 which has a median of 1729 (Q3) Five-Point Summary • Q0 (the minimum) • Q1 (25th percentile) • Q2 (median) • Q3 (75th percentile) • Q4 (the maximum) §4.6 Boxplots 1. Calculate 5-point summary. Draw box from Q1 to Q3 w/ line at median 2. Calculate IQR and fences as follows: FenceLower = Q1 - 1.5(IQR) FenceUpper = Q3 + 1.5(IQR) Do not draw fences 3. Determine if any values lie outside the fences (outside values). If so, plot these separately. 4. Determine values inside the fences (inside values) Draw whisker from Q3 to upper inside value. Draw whisker from Q1 to lower inside value Illustrative Example: Boxplot 1. 5 pt summary: {5, 21, 27.5, 42, 52}; box from 21 to 42 with line @ 27.5 2. IQR = 42 - 21 = 21. FU = Q3 + 1.5(IQR) = 42 + (1.5)(21) = 73.5 FL = Q1 - 1.5(IQR) = 21 - (1.5)(21) = -10.5 3. None values above upper fence None values below lower fence 4. Upper inside value = 52 Lower inside value = 5 Draws whiskers Data: 05 11 21 24 27 28 30 42 50 52 60 50 40 30 20 10 0 Upper inside = 52 Q3 = 42 Q1 = 21 Lower inside = 5 Q2 = 27.5 Illustrative Example: Boxplot 2 Data: 3 21 22 24 25 26 28 29 31 51 60 50 40 30 20 10 0 Outside value (51) Outside value (3) Inside value (21) Upper hinge (29) Lower hinge (22) Median (25.5) Inside value (31) 1. 5-point summary: 3, 22, 25.5, 29, 51: draw box 2. IQR = 29 - 22 = 7 FU = Q3 + 1.5(IQR) = 28 + (1.5)(7) = 39.5 FL = Q1 - 1.5(IQR) = 22 - (1.5)(7) = 11.6 3. One above top fence (51) One below bottom fence (3) 4. Upper inside value is 31 Lower inside value is 21 Draw whiskers Illustrative Example: Boxplot 3 Seven metabolic rates: 1362 1439 1460 1614 1666 1792 1867 N = 7 Data source: Moore, 2000 1900 1800 1700 1600 1500 1400 1300 1. 5-point summary: 1362, 1449.5, 1614, 1729, 1867 2. IQR = 1729 - 1449.5 = 279.5 FU = Q3 + 1.5(IQR) = 1729 + (1.5)(279.5) = 2148.25 FL = Q1 - 1.5(IQR) = 1449.5 - (1.5)(279.5) = 1030.25 3. None outside 4. Whiskers end @ 1867 and 1362 Boxplots: Interpretation • Location - Position of median - Position of box • Spread - Hinge-spread (IQR) - Whisker-to-whisker spread - Range • Shape - Symmetry or direction of skew - Long whiskers (tails) indicate leptokurtosis Side-by-side boxplots Boxplots are especially useful when comparing groups §4.7 Spread: Standard Deviation • Most common descriptive measures of spread • Based on deviations around the mean. • This figure demonstrates the deviations of two of its values Figure 4.6 Deviations of two observations, site 2, air samples illustrative data, Table 4.2. This data set has a mean of 36. The data point 33 has a deviation of 33 - 36 = −3. The data point 40 has a deviation of 40 - 36 = 4. Variance and Standard Deviation x x Deviation = i 2 SS x x i Sum of squared deviations = 1 2 n SS Sample variance = s 2 Sample standard deviation = s s Standard deviation (formula) 2 ( ) 1 1 x x n s i Sample standard deviation s is the estimator of population standard deviation . See "Facts About the Standard Deviation" p. 93. Sum of Squares Illustrative Example: Standard Deviation (p. 92) Observation Deviations Squared deviations 36 36 36 = 0 0 2 = 0 38 38 36 = 2 2 2 = 4 39 39 36 = 3 3 2 = 9 40 40 36 = 4 4 2 = 16 36 36 36 = 0 0 2 = 0 34 34 36 = 2 2 2 = 4 33 33 36 = 3 3 2 = 9 32 32 36 = 4 4 2 = 16 SUMS 0* SS = 58 2 x x x x i i i x * Sum of deviations always equals zero Illustrative Example (cont.) 2 3 2 8.286 ( g/m ) 8 1 5 8 1 n S S s 2 3 s s 8.286 2.8 8 g/m Sample variance (s 2) Standard deviation (s) Interpretation of Standard Deviation • Measure spread (e.g., if group was s1 = 15 and group 2 s2 = 10, group 1 has more spread, i.e., variability) • 68-95-99.7 rule (next slide) • Chebychev's rule (two slides hence) 68-95-99.7 Rule Normal Distributions Only! • 68% of data in the range μ ± σ • 95% of data in the range μ ± 2σ • 99.7% of data the range μ ± 3σ • Example. Suppose a variable has a Normal distribution with = 30 and σ = 10. Then: 68% of values are between 30 ± 10 = 20 to 40 95% are between 30 ± (2)(10) = 30 ± 20 = 10 to 50 99.7% are between 30 ± (3)(10) = 30 ± 30 = 0 to 60 Chebychev's Rule All Distributions • Chebychev's rule says that at least 75% of the values will fall in the range μ ± 2σ (for any shaped distribution) • Example: A distribution with μ = 30 and σ = 10 has at least 75% of the values in the range 30 ± (2)(10) = 10 to 50 Rules for Rounding • Carry at least four significant digits during calculations. (Click here to learn about significant digits.) • Round as last step of operation • Avoid pseudo-precision • When in doubt, use the APA Publication Manual • Always report units • Always use common sense and good judgment. Choosing Summary Statistics • Always report a measure of central location, a measure of spread, and the sample size • Symmetrical mound-shaped distributions report mean and standard deviation • Odd shaped distributions report 5-point summaries (or median and IQR) Software and Calculators Use software and calculators to check work.
chapter 5
Summary Statistics • Central location - Mean - Median - Mode • Spread - Range and interquartile range (IQR) - Variance and standard deviation • Shape summaries - seldom used in practice Notation • n sample size • X the variable (e.g., ages of subjects) • xi the value of individual i for variable X • sum all values (capital sigma) • Illustrative data (ages of participants): 21 42 5 11 30 50 28 27 24 52 n = 10 X = AGE variable x1= 21, x2= 42, ..., x10= 52 xi = x1 + x2 + ... + x10= 21 + 42 + ... + 52 = 290 §4.1: Central Location: Sample Mean • "Arithmetic average" • Traditional measure of central location • Sum the values and divide by n • "xbar" refers to the sample mean n i n xi n x x x n x 1 1 1 1 2 Example: Sample Mean Ten individuals selected at random have the following ages: 21 42 5 11 30 50 28 27 24 52 Note that n = 10, xi = 21 + 42 + ... + 52 = 290, and (290) 2 9.0 1 0 1 1 xi n x Figure 4.1 The mean is the balancing point of a distribution Uses of the Sample Mean • The sample mean: • The value of an observation drawn at random from the sample can be used to predict the population mean Population Mean • Same operation as sample mean except based on entire population (N ≡ population size) • Conceptually important • Usually not available in practice • Sometimes referred to as the expected value i i x N N x 1 §4.2 Central Location: Median The median is the value with a depth of (n+1)/2 When n is even, average the two values that straddle a depth of (n+1)/2 For the 10 values listed below, the median has depth (10+1) / 2 = 5.5, placing it between 27 and 28. Average these two values to get median = 27.5 05 11 21 24 27 28 30 42 50 52 median Average the adjacent values: M = 27.5 More Examples of Medians • Example A: 2 4 6 Median = 4 • Example B: 2 4 6 8 Median = 5 (average of 4 and 6) • Example C: 6 2 4 Median 2 (Values must be ordered first) The Median is Robust • The median is more resistant to skews and outliers than the mean; it is more robust. • This data set has a mean of 1636: 1362 1439 1460 1614 1666 1792 1867 • Here's the same data set with a data entry error "outlier" (highlighted). This data set has a mean of 2743: 1362 1439 1460 1614 1666 1792 9867 • The median is 1614 in both instances, demonstrating its robustness in the face of outliers. §4.3: Mode • The mode is the most commonly encountered value in the dataset • This data set has a mode of 7 {4, 7, 7, 7, 8, 8, 9} • This data set has no mode {4, 6, 7, 8} (each point appears only once) • The mode is useful only in large data sets with repeating values Figure 4.4 Effect of a skew on the mean, median, and mode. Note how the mean gets pulled toward the longer tail more than the median mean = median → symmetrical distrib mean > median → positive skew mean < median → negative skew §4.5 Spread: Quartiles • Two distributions can be quite different yet can have the same mean • This data compares particulate matter in air samples (μg/m3 ) at two sites. Both sites have a mean of 36, but Site 1 exhibits much greater variability. We would miss the high pollution days if we relied solely on the mean. Site 1| |Site 2 ---------------- 42|2| 8|2| 2|3|234 86|3|6689 2|4|0 |4| |5| |5| |6| 8|6| ×10 Spread: Range • Range = maximum - minimum • Illustrative example: Site 1 range = 86 - 22 = 64 Site 2 range = 40 - 32 = 8 • Beware: the sample range will tend to underestimate the population range. • Always supplement the range with at least one addition measure of spread Site 1| |Site 2 ---------------- 42|2| 8|2| 2|3|234 86|3|6689 2|4|0 |4| |5| |5| |6| 8|6| ×10 Spread: Quartiles • Quartile 1 (Q1): cuts off bottom quarter of data = median of the lower half of the data set • Quartile 3 (Q3): cuts off top quarter of data = median of the upper half of the data set • Interquartile Range (IQR) = Q3 - Q1 covers the middle 50% of the distribution 05 11 21 24 27 28 30 42 50 52 Q1 median Q3 Q1 = 21, Q3 = 42, and IQR = 42 - 21 = 21 Quartiles (Tukey's Hinges) - Example 2 Data are metabolic rates (cal/day), n = 7 1362 1439 1460 1614 1666 1792 1867 median • When n is odd, include the median in both halves of the data set. • Bottom half: 1362 1439 1460 1614 which has a median of 1449.5 (Q1) • Top half: 1614 1666 1792 1867 which has a median of 1729 (Q3) Five-Point Summary • Q0 (the minimum) • Q1 (25th percentile) • Q2 (median) • Q3 (75th percentile) • Q4 (the maximum) §4.6 Boxplots 1. Calculate 5-point summary. Draw box from Q1 to Q3 w/ line at median 2. Calculate IQR and fences as follows: FenceLower = Q1 - 1.5(IQR) FenceUpper = Q3 + 1.5(IQR) Do not draw fences 3. Determine if any values lie outside the fences (outside values). If so, plot these separately. 4. Determine values inside the fences (inside values) Draw whisker from Q3 to upper inside value. Draw whisker from Q1 to lower inside value Illustrative Example: Boxplot 1. 5 pt summary: {5, 21, 27.5, 42, 52}; box from 21 to 42 with line @ 27.5 2. IQR = 42 - 21 = 21. FU = Q3 + 1.5(IQR) = 42 + (1.5)(21) = 73.5 FL = Q1 - 1.5(IQR) = 21 - (1.5)(21) = -10.5 3. None values above upper fence None values below lower fence 4. Upper inside value = 52 Lower inside value = 5 Draws whiskers Data: 05 11 21 24 27 28 30 42 50 52 60 50 40 30 20 10 0 Upper inside = 52 Q3 = 42 Q1 = 21 Lower inside = 5 Q2 = 27.5 Illustrative Example: Boxplot 2 Data: 3 21 22 24 25 26 28 29 31 51 60 50 40 30 20 10 0 Outside value (51) Outside value (3) Inside value (21) Upper hinge (29) Lower hinge (22) Median (25.5) Inside value (31) 1. 5-point summary: 3, 22, 25.5, 29, 51: draw box 2. IQR = 29 - 22 = 7 FU = Q3 + 1.5(IQR) = 28 + (1.5)(7) = 39.5 FL = Q1 - 1.5(IQR) = 22 - (1.5)(7) = 11.6 3. One above top fence (51) One below bottom fence (3) 4. Upper inside value is 31 Lower inside value is 21 Draw whiskers Illustrative Example: Boxplot 3 Seven metabolic rates: 1362 1439 1460 1614 1666 1792 1867 N = 7 Data source: Moore, 2000 1900 1800 1700 1600 1500 1400 1300 1. 5-point summary: 1362, 1449.5, 1614, 1729, 1867 2. IQR = 1729 - 1449.5 = 279.5 FU = Q3 + 1.5(IQR) = 1729 + (1.5)(279.5) = 2148.25 FL = Q1 - 1.5(IQR) = 1449.5 - (1.5)(279.5) = 1030.25 3. None outside 4. Whiskers end @ 1867 and 1362 Boxplots: Interpretation • Location - Position of median - Position of box • Spread - Hinge-spread (IQR) - Whisker-to-whisker spread - Range • Shape - Symmetry or direction of skew - Long whiskers (tails) indicate leptokurtosis Side-by-side boxplots Boxplots are especially useful when comparing groups §4.7 Spread: Standard Deviation • Most common descriptive measures of spread • Based on deviations around the mean. • This figure demonstrates the deviations of two of its values Figure 4.6 Deviations of two observations, site 2, air samples illustrative data, Table 4.2. This data set has a mean of 36. The data point 33 has a deviation of 33 - 36 = −3. The data point 40 has a deviation of 40 - 36 = 4. Variance and Standard Deviation x x Deviation = i 2 SS x x i Sum of squared deviations = 1 2 n SS Sample variance = s 2 Sample standard deviation = s s Standard deviation (formula) 2 ( ) 1 1 x x n s i Sample standard deviation s is the estimator of population standard deviation . See "Facts About the Standard Deviation" p. 93. Sum of Squares Illustrative Example: Standard Deviation (p. 92) Observation Deviations Squared deviations 36 36 36 = 0 0 2 = 0 38 38 36 = 2 2 2 = 4 39 39 36 = 3 3 2 = 9 40 40 36 = 4 4 2 = 16 36 36 36 = 0 0 2 = 0 34 34 36 = 2 2 2 = 4 33 33 36 = 3 3 2 = 9 32 32 36 = 4 4 2 = 16 SUMS 0* SS = 58 2 x x x x i i i x * Sum of deviations always equals zero Illustrative Example (cont.) 2 3 2 8.286 ( g/m ) 8 1 5 8 1 n S S s 2 3 s s 8.286 2.8 8 g/m Sample variance (s 2) Standard deviation (s) Interpretation of Standard Deviation • Measure spread (e.g., if group was s1 = 15 and group 2 s2 = 10, group 1 has more spread, i.e., variability) • 68-95-99.7 rule (next slide) • Chebychev's rule (two slides hence) 68-95-99.7 Rule Normal Distributions Only! • 68% of data in the range μ ± σ • 95% of data in the range μ ± 2σ • 99.7% of data the range μ ± 3σ • Example. Suppose a variable has a Normal distribution with = 30 and σ = 10. Then: 68% of values are between 30 ± 10 = 20 to 40 95% are between 30 ± (2)(10) = 30 ± 20 = 10 to 50 99.7% are between 30 ± (3)(10) = 30 ± 30 = 0 to 60 Chebychev's Rule All Distributions • Chebychev's rule says that at least 75% of the values will fall in the range μ ± 2σ (for any shaped distribution) • Example: A distribution with μ = 30 and σ = 10 has at least 75% of the values in the range 30 ± (2)(10) = 10 to 50 Rules for Rounding • Carry at least four significant digits during calculations. (Click here to learn about significant digits.) • Round as last step of operation • Avoid pseudo-precision • When in doubt, use the APA Publication Manual • Always report units • Always use common sense and good judgment. Choosing Summary Statistics • Always report a measure of central location, a measure of spread, and the sample size • Symmetrical mound-shaped distributions report mean and standard deviation • Odd shaped distributions report 5-point summaries (or median and IQR) Software and Calculators Use software and calculators to check work.
Variability Within Groups • Variability of data points within groups → quantifies random "noise" • Based on a statistic called the Mean Square Within (MSW) Notation SSW ≡ sum of squares within dfW ≡ degrees of freedom within N ≡ sample size, all groups combined ni ≡ sample size, group i s 2 i ≡ variance of group i Mean Square Within: Formula • Mean Square Within • Sum of Squares Within • Degrees of Freedom Within Figure 13.8 Me
Variability Within Groups • Variability of data points within groups → quantifies random "noise" • Based on a statistic called the Mean Square Within (MSW) Notation SSW ≡ sum of squares within dfW ≡ degrees of freedom within N ≡ sample size, all groups combined ni ≡ sample size, group i s 2 i ≡ variance of group i Mean Square Within: Formula • Mean Square Within • Sum of Squares Within • Degrees of Freedom Within Figure 13.8 Me Table D ("F Table") • The F table has limited listings for df2 . • You often must round-down to the next available df2 (rounding down preferable for conservative estimate). • Wedge the Fstat between listing to find the approximate P-value Bonferroni post hoc comparison, pets and stress illustrative example. Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation. P-values from Bonferroni are higher and confidence intervals are broader than LSD method, reflecting its conservative approach The Equal Variance Assumption • Conditions for ANOVA: 1. Sampling independence 2. Normal sampling distributions of mean 3. Equal variance within population groups • Let us focus on condition 3, since conditions 1 and 2 are covered elsewhere. • Equal variance is called homoscedasticity. (Unequal variance = heteroscedasticity). • Homoscedasticity allows us to pool group variances to form the MSW Assessing "Equal Variance" 1. Graphical exploration. Compare spreads visually with side-by-side plots. 2. Descriptive statistics. If a group's standard deviation is more than twice that of another, be alerted to possible heteroscedasticity 3. Test variances. A statistical test can be applied (next slide). Levene's Test of Variances A.Hypotheses. H0 : σ 2 1 = σ 2 2 = ... = σ 2 k Ha : at least one σ 2 i differs B. Test statistic. Test is performed by computer. The test statistic is a particular type of Fstat based on the rank transformed deviations (see p. 317 for details). C. P-value. The Fstat is converted to a P-value by the computational program. Interpretation of P is routine small P evidence against H0 , suggesting heteroscedasticity. Figure 13.16 Levene's test for pets and stress illustrative example A. H0 : σ 2 1 = σ 2 2 = σ 2 3 versus Ha : at least one σ 2 i differs B. SPSS output (below). Fstat = 0.059 with 2 and 42 df C. P = 0.943. Very weak evidence against H0 retain assumption of homoscedasticity Analyzing Groups with Unequal Variance • Stay descriptive. Use summary statistics and EDA methods to compare groups. • Remove outliers, if appropriate (p. 321). • Mathematically transform the data to compensate for heteroscedasticity (e.g., a long right tail can be pulled in with a log transform). • Use robust non-parametric methods. 13.6 Intro to Nonparametric Methods • Many nonparametric procedures are based on rank transformed data ("rank tests"). Here are examples: The Kruskal-Wallis Test • Let us explore the Kruskal-Wallis test as an example of a non-parametric test • The Kruskal-Wallis test is the non-parametric analogue of one-way ANOVA. • It does not require Normality or Equal Variance conditions for inference. • It is based on rank transformed data and seeing if the mean ranks in groups differ significantly. Kruskal-Wallis Test • The K-W hypothesis can be stated in terms of mean or median (depending on assumptions made about population shapes). Let us use the later. • Let Mi ≡ the median of population i • There are k groups • H0 : M1 = M2 = ... = Mk • Ha : at least one Mi differs Table 13.11 Kruskal-Wallis Example: Descriptive statistics for alcohol and income data Kruskal-Wallis Test, Example. Results of Levene's test for equal variance, alcohol and income illustrative example. • We wish to test whether the means differ significantly but find graphical and hypothesis testing evidence that the population variances are unequal. • See also next slide, Figure Kruskal-Wallis Test, Example, cont. A. Hypotheses. H0 : M1 = M2 = M3 = M4 = M5 vs. Ha : at least one Mi differs B. Test statistic. Some computer programs use chi-square statistic based upon a Normal approximation. SPSS derives Chi-square statistics = 7.793 with 4 df (next slide)
Summary of Points (measurement)
pg 13 biosttistics measurement, meassurement scales, categorical nominal, ordina, quantitative, obsercation, variable, values, quality of its measurements, precisiosn, validity.
Chapter 2
pg 21 surveys and comparative studies surveys are used to quantify population characteristics. the population consists of all entities worthy of study. the census is a survey that attempts to collect Infomation on all individuals in the population. Samples - surveys collect data on only a portion or sample of the population. data in a sample saves time and money. sampling is the rule in sttistics; rarely are ata collected for the entire population
simple random samples
pg 22 SRS must be collected in such a way to allow for generalizations to be made to the entire population. must entail an element of chance . The idea of simple random sampling is to collect data from the population so each population member has the same probability or being selected int the sample and the selection of any individual into the sample does not influence of likelihood of selecting any other individual. Srs can be achieved by placing identifiers for population members in a hat thoroughly mixing up the identifiers and then blindly drawing entries in practice, a table of random digits or a software program is used to aid the selection process.
sampling with replacement and sampling withut replacement
pg 24 sampling a finite population can be done with replacement or without replacement. sampling with replacement is accomplished by "tossing selected members back into the mix after they have been selected. in this way any given unit can appear more than once in a sample. Sampling without replacement is done so that once a populatin member has been selected the selected uit is removed from possible future reselectin this too is a legitimate way to select a simple random sample. The distinction between sampling with replacement and without replacement is of consequence only when more complex sampling designs are used.
Sampling fraction
pg 24 the ratio of the size of sample (n) to the population size (N) is the sampling fraction. For example, in selecting n=6 individuals fom a populatin of N=600, the sampling franction =6/600 = 001 or 1%
sampling fraction
pg 24 the ratio of the size of sample (n) to the population size (N) is the sampling fraction. sample n=6 individuals from a population of N = 600, the sampling fraction = 6/600=0..01 or 1%
Placebo
pg 28
non experiemental studies
pg 28
obstervations studies
pg 28
random assignment of treatments
pg 35
Ethics
pg 38
data collection form
pg 4 corresponds to observations
Axis multimplier
pg 43
Kurtosis
pg 45
Shape
pg 45
bimodal
pg 45
dkewed
pg 45
leptokurtic
pg 45
modality
pg 45
negative skew
pg 45
platykuric
pg 45
symmetry
pg 45
unimodal
pg 45
outlier
pg 46
location
pg 47
median
pg 48
spread
pg 49
splitting stsem values
pg 52
endpoint conventions
pg 61
Construct a frequency table with class intervals
pg 62
Bar charts
pg 63
frequency polygons
pg 63
pie charts
pg 63
Summary points (frequency distributions)
pg 65 frequency distributions, distibutional shape, the location, the spread, the stem and leaf plot, additional graphical techniques, frequency tables.
Cargo Cult
pg 9 has come to mean a pseudoscientific that follows precepts and forms, but it is missing in the honest, self-critical that is essential to scientific investigation
Bias
pg. 10 expresses itself as a tendency to overestimate or underestimate the true value of an object. . in a prtice it is easier to quantify imprecision than bias. pg 11
Objectivity
pg. 10 the intent to measure things as they are withoug shaping them to conform to a preconceived worldview.
Cautions
pg. 24 samples that tend to overrepresent or underrepresent certain segments of the population can bias the results of a survey in favor of certain outcome. Here are examples of sellection biases we wish to avoid, under coverage and volunteer bias
Sample population
pg. 26 identify the source population and sample as specifically as possible. if the informatioion is insufficient do your best to provide a reasonable descrition of the population and sample and then suggest additional person, place and time characteristcs
comparative studies
pg. 28 objective is to learn about the relationship between an explanatory variable and response variable. experimental studies the investigator assigns the exposure to one group whie leaving the other nonexposed.
Modality
pg. 45