Quiz 2 stats analysis

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

three reasons why multiple t tests are not appropriate.

1. A greater probability of making a type I error (rejecting the null hypoth-esis when it is really true) exists when one conducts multiple t tests on samples taken from the same population. When a single t test is performed, the findings are compared with the probability of chance. If a confidence level of 95% is set, we are willing to refute chance if the odds are 19 to 1 against it. When multiple t tests are conducted on samples randomly drawn from the same population, the odds of finding the one deviant conclusion that is expected by chance alone increase. When 20 t tests are performed at α = .05 on completely random data, it is expected that one of the tests will be found significant by chance alone (1-to-19 odds). Therefore, if we perform 20 t tests on treated data, and one of the 20 tests produces a significant difference, we cannot tell whether it represents a true dif-ference due to treatment or whether it represents the one deviant score out of 20 that is expected by chance alone. If it is due to chance but falsely declared to be significant, a type I error has been made. This dilemma is sometimes referred to as the familywise error rate (the error rate when making a family of comparisons). Keppel (1991, p. 164) reports that the relationship between the single comparison error rate (α) and the familywise error rate (FWα ) is FWα = 1 − (1 − α)c , (11.01) where C is the number of comparisons to be made. If we conduct six t tests, comparing all possible combinations of four groups (A, B, C, and D), at α = .05, then FWα = 1 − (1 − .05)6 = .26. In this example, conducting multiple t tests raises the probability of a type I error from .05 to .26. Keppel suggests that familywise error rate may be roughly estimated by the product of the number of comparisons to be made and alpha (C and α). In this example, FWα = ~ 6 × .05 = .30. This method will always overestimate familywise error rate but is fairly close for small values of the number of comparisons to be made and alpha. Analysis of variance eliminates the problem of familywise errors by making all possible comparisons among the means in a single test. Multiple t tests on samples from the same population may be required on occa-sion. When this is the case, a commonly used modification of the alpha level called a Bonferroni adjustment is recommended. To perform the adjustment, divide the single-test alpha level by the number of tests to be performed. If five tests are to be made at α = .05, the adjusted alpha level to reject H0 This is referred to as the per comparison alpha level. 2. The t test does not make use of all available information about the population from which the samples were drawn. The t test is built on the assumption that only two groups have been randomly selected from the popula-tion. In the t test, the estimate of the standard error of the difference between means is based on data from two samples only. When three or more samples have been selected, information about the population from three or more samples is available and should be used in the analysis, yet t considers only two samples at a time. Analysis of variance uses all of the available information from all samples simultaneously. 3. Multiple t tests require more time and effort than a simple ANOVA. It is easier, especially with a computer, to conduct one F test than to conduct multiple t tests. Because of these reasons, analysis of variance is used in place of multiple t tests when three or more groups of data are involved. Analysis of variance can determine whether a significant difference exists among any of the groups repre-sented in the experiment, but it does not identify the group or groups that differ. A significant F value indicates only that at least one significant difference exists somewhere among the many possible combinations of groups. When a significant F is found, additional post hoc (after the fact) tests must be performed to iden-tify the group or groups that differ. If F is not significant, no further analysis is needed we know that no significant differences exist among any of the groups.

standard error of difference

A similar formula permits us to calculate the standard error of the difference (SED ), the amount of difference between two randomly drawn sample means that may be attributed to chance alone. We can use equation 10.07 to estimate the amount of difference between the two means attributable to chance:

continued uses of stan error

A second use of the standard error of measurement that is particularly helpful to practitioners who need to make inferences about individual athletes or patients is the ability to estimate the change in performance or minimal difference needed to be considered real (sometimes called the minimal detectable change). This is typical in situations in which the practitioner measures the performance of an individual and then performs some intervention (e.g., exercise program or therapeutic treat-ment). The test is then given after the intervention, and the practitioner wishes to know whether the person really improved. Suppose that an athlete improved per-formance on the Wingate test by 100 watts after an 8-week training program. The savvy coach should ask whether an improvement of 100 watts is a real increase in anaerobic fitness or whether a change of 100 watts is within what one might expect simply due to the measurement error of the Wingate test. The minimal difference can be estimated as We would then infer that a change in individual performance would need to be at least 90.7 watts for the practitioner to be confident, at the 95% LOC, that the change in individual performance was a real improvement. In our example, we would be 95% confident that a 100-watt improvement is real because it is more than we would expect just due to the measurement error of the Wingate test. Hopkins (2000) has argued that the 95% LOC is too strict for these types of situations and a less severe level of confidence should be used. This is easily done by choosing a critical Z score appropriate for the desired level of confidence. It is not intuitively obvious why the use of the 2 term in equation 13.10 is necessary. That is, one might think that simply using equation 13.09 to construct the true score confidence interval bound around the preintervention score and then seeing whether the postintervention score is outside that bound would provide the answer we seek. However, this argument ignores the fact that both the preinterven-tion score and the postintervention score are measured with error, and this approach considers only the measurement error in the preintervention score. Because both observed scores were measured with error, simply observing whether the second score falls outside the confidence interval of the first score does not account for both sources of measurement error. We use the 2 term because we want an index of the variability of the differ-ence scores when we calculate the minimal difference. The standard deviation of the difference scores (SDd measurements like we have here, SEM = SDd ) provides such an index, and when there are only two / 2 . We can then solve for the standard deviation of the difference scores by multiplying the standard error of measurement by 2 . Equation 13.10 can be re-conceptualized as We recommend a three-layered approach to quantifying measurement error. First, a repeated measures ANOVA should be performed to assess the presence of sys-tematic error. If sufficient systematic error is present, the measurement schedule should be re-evaluated and modified by perhaps including practice sessions or modifying rest intervals. If systematic error is present when assessing inter-rater reliability, perhaps one or more of the raters is scoring performance incorrectly. Second, the repeated measures ANOVA provides the necessary mean square terms to calculate the intraclass correlation coefficient of choice. Third, the square root of the mean square error term can be used to quantify the standard error of measurement.

Omega squared

Another common method of determining the size of the effect is omega squared (ω2 ω= 2 SS kMS SS MS −− + 50.82 − 5−1( )(1.58) BE1 TE ()() For the data in table 11.5, omega squared is calculated as follows: ω2 = = .45. 98.19 +1.58 ): (11.19) varies between 0.00 and 1.00 depending on the relative size of

• ANOVA Pg. 178-180

Analysis of variance (ANOVA) is a parametric statistical technique used to determine whether significant differences exist among means from three or more sets of sample data. In a t test, the differences between the means of two groups are compared with the difference expected by chance alone. Analysis of variance examines differences between the means of three or more groups by comparing the variability between the group means (the between-group variability) with the variability of scores within the groups (the within-group variability). This produces a ratio value called F (F = average variance between groups divided by average variance within groups). The symbol for ANOVA (F) is named after the English mathematician Ronald Aylmer Fisher (1890-1962), who first described it (Kotz and Johnson, 1982, p. 103). If the between-group variability exceeds the within-group variability by more than would be expected by chance alone, it may be concluded that at least one of the group means differs significantly from another group mean. The null hypothesis (H0 ) for an F test is designated as μ1 = μ2 = μ3 = ... μk . The null hypothesis assumes that the means of all the samples are equal. Another way to say this is that a single population is the parent of the several random untreated samples drawn from it. Therefore, untreated means of samples randomly drawn from the same population(s) should not differ by more than chance. When at least one sample mean is significantly different from any other, F is significant and we reject the null hypothesis for one or more of the samples. Like the t test, the theoretical concepts of ANOVA are based on random samples drawn from a parent population and the characteristics of the normal curve. If a large number of samples are randomly drawn from a population, the variance among the scores of all subjects in all groups is the best estimate of the variance of the population. When the variance among all the scores in a data set is known, it may be used to determine the probability that a deviant score is not randomly drawn from the same population. This argument may be expanded to infer that if randomly drawn scores are randomly divided into subgroups, the variance among the subgroup means may be expected to be of the same relative magnitude as the variance among all of the individual scores that comprise the several groups. With untreated data, when only random factors are functioning between the group means and within the scores of the groups, the between-group and within-group variances should be approximately equal. The F ratio, the ratio of the average between-group variance divided by the average within-group variance, is expected to be about 1.00. When the value of the F ratio exceeds 1.00 by more than would be expected by chance alone, the variance between the means is judged to be significant (i.e., the difference was caused by a factor other than chance) and H0 is rejected. The F ratio is analogous to the t ratio in that both compare the actual, or observed, mean differences with differences expected by chance. When this ratio exceeds the limits of chance at a given level of confidence, chance as a cause of the differences is rejected. In the F test, the actual differences are the variances between the group means, and the expected differences are the variances within the individual scores that make up the groups. The t test is actually a special case of ANOVA with two groups. Because t uses standard deviations and ANOVA uses variance (V) to evaluate mean differences, and SD2 = V, when there are only two groups in ANOVA, t2 = F. Analysis of variance is one of the most commonly used statistical techniques in research. But students often ask, Why is this new technique needed when a t test could be used between each of the groups? For example, in a four-group study, why not conduct six t tests—between groups A and B, A and C, A and D, B and C, B and D, and C and D?

• Background of nonparametric statistics Pg. 272

As discussed in chapter 1, data can be classified into four categories: nominal, ordinal, interval, and ratio. When ratio or interval data are collected, analysis by parametric statistical techniques is appropriate. Pearson's correlation coefficient, the t test, and ANOVA in all of its varieties are parametric statistical techniques. But when data are of the nominal or ordinal type, the assumptions of normality are not met and nonparametric statistical techniques must be used. This chapter presents several of the most commonly used nonparametric sta-tistical techniques: • Chi-square (χ2 ) compares two or more sets of nominal data that have been arranged into categories by frequency counts. • Spearman's rank order correlation coefficient determines the relationship between two sets of ordinal data. • The Mann-Whitney U test determines the significance of the difference between ordinal rankings of two groups of subjects ranked on the same vari-able. It is similar to an independent t test. • Kruskal-Wallis ANOVA by ranks compares the ranking of three or more independent groups. It is similar to simple ANOVA. • Friedman's two-way ANOVA by ranks is similar to repeated measures ANOVA. It determines the significance of the difference between ranks on the same subjects.

Correlation background

CCorrelation is widely used in kinesiology. For example, biomechanists might cal-culate the correlation between performance in the power clean (a weight-lifting exercise) and performance in the vertical jump (we might expect that athletes who performed well on the power clean would also perform well on the vertical jump). Exercise physiologists may calculate the correlation between skinfold thicknesses and percent body fat. Correlation is used to quantify the degree of relationship, or association, between two variables. When we calculate a correlation, we get a number (specifically, a numerical coefficient) that indicates the extent to which two variables are related or associated. Technically speaking, correlation is the extent to which the direction and size of deviations from the mean in one variable are related to the direction and size of deviations from the mean in another variable. This technique is referred to as Pearson's product moment correlation coefficient; it is named after Karl Pearson (1857-1936), who developed this concept in 1896 in England (Kotz and Johnson, 1982, p. 199). The coefficient, or number, that represents the correlation will always be between +1.00 and −1.00. A perfect positive correlation (+1.00) would exist if every subject varied an equal distance from the mean in the same direction (measured by a Z score) on two variables. For example, if every subject who was 1 Z score above the mean on variable X was also 1 Z score above the mean on variable Y, and every other subject showed a similar relationship between deviation score on X and deviation score on Y, the resulting correlation would be +1.00. Similarly, if all subjects who were above or below the mean on variable X were an equal distance in the opposite direction from the mean on variable Y, the result-ing correlation would be −1.00. If these interrelations are similar but not perfect, then the correlation coefficient is less than +1.00, such as .90 or .80 in the positive case, and greater than −1.00, such as −.90 or −.80 in the negative case. A correla-tion coefficient of 0.00 means that no relationship exists between

Forward selection

Forward selection is a computer-generated approach in which the algorithm starts with no variables in the model and then X variables are added one at a time into the regression equation. To illustrate, consider the data in table 9.1.

o Mann-Whitney U Pg. 279-282

The Mann-Whitney U test is used to determine the significance of the difference between rankings of two groups of subjects who have been ranked on the same variable. The U value indicates whether one group ranks significantly higher than the other. It is the nonparametric equivalent of an independent, two-group t test. It may be used instead of the t test when the assumptions of the t test cannot be met, such as when the data are ordinal or highly skewed. When interval or ratio data are highly skewed, we may want to create one rank order list (on the dependent variable) for all subjects from both groups. For example, the highest scoring person from both groups is ranked 1, the second highest scorer is ranked 2, and so on until all subjects in both groups have been ranked. Then we compare the ranks in group 1 to the ranks in group 2 using the Mann-Whitney U test. All subjects from both groups are ranked on the same variable and placed in order from highest to lowest, and the subjects in group 1 and group 2 are then listed by their ranks. The sums of the ranks for each group are compared to determine whether the median rankings between the groups differ by more than would be expected by chance alone. The formulas in equations 16.05 and 16.06 are modified from Bruning and Kintz (1977, p. 224). An Example From Motor Learning A student in a motor learning class was assigned a term project to ascertain whether gymnasts had better balance skills than the general population. The student mea-sured 10 gymnasts and 15 nongymnasts on the Bass stick test for upright balance. The test results of all 25 subjects were then ranked (1 for best, 25 for worst) in a single list. The results are shown in table 16.7. For these data, U1 and U1 U U 1 2 =()()+ 10 15 are computed as follows: () 10 10 1 2 + =()()−= 10 15 102 48.8 Because n1 + n2 ≥ 20, a Z score i Note: The number 12 is a constant and will always be in the denominator within the square root sign. The number 2 in the numerator represents the number of groups. It does not matter which U value is used to calculate Z. The Z values for U1 U2 will have the same absolute value: One will be positive and one will be nega-tive. We shall use U1 Because we are testing H0 10 15 2 ()() ++ 10 15 10 15 1 12 () , we use a two-tailed test to interpret Z. For a two-tailed test, we need Z = 1.65 for α = .10, Z = 1.96 for α = .05, and Z = 2.58 for α = .01 (see table A.1 in appendix A). In this problem, Z does not reach the limits for α = .10, so we accept H0 and conclude that no significant difference exists between gymnasts and nongymnasts in balance ability as measured by the Bass stick test. Comparing Groups With Small Values of N If n1 + n2 < 20, the Z test may be biased, so a table of U (tables A.13 to A.15 in appendix A) must be used to determine the significance of U. The critical U value is found in the table and compared with the smaller calculated U. If the smaller U value is equal to or less than the table value, the rank difference is significant. In the balance problem, the smaller U (U2 ) is 48. Table A.13 shows that for α = .10 (a two-tailed test with n1 = 10 and n2 = 15), the smaller U must be 44 or less. Because U1 = 48, the difference between the ranks is not significant. This agrees with our conclusion based on

• Standard error of measurement Pg. 222-226

The intraclass correlation coefficient provides an estimate of the relative error of the measurement; that is, it is unitless and is sensitive to the between-subjects variability. Because the general form of the intraclass correlation coefficient is a ratio of variabilities (see equation 13.04), it is reflective of the ability of a test to differentiate between subjects. It is useful for assessing sample size and statistical power and for estimating the degree of correlation attenuation. As such, the intra-class correlation coefficient is helpful to researchers when assessing the utility of a test for use in a study involving multiple subjects. However, it is not particularly informative for practitioners such as clinicians, coaches, and educators who wish to make inferences about individuals from a test result. For practitioners, a more useful tool is the standard error of measurement (SEM; not to be confused with the standard error of the mean). The standard error of measurement is an absolute estimate of the reliability of a test, meaning that it has the units of the test being evaluated, and is not sensitive to the between-subjects variability of the data. Further, the standard error of measurement is an index of the precision of the test, or the trial-to-trial noise of the test. Standard error of measure-ment can be estimated with two common formulas. where ICC is the intraclass correlation coefficient as described previously and SD is the standard deviation of all the scores about the grand mean. The standard deviation can be calculated quickly from the repeated measures ANOVA as where N is the total number of scores. Because the intraclass correlation coefficient can be calculated in multiple ways and is sensitive to between-subjects variability, the standard error of measurement calculated using equation 13.06 will vary with these factors. To illustrate, we use the example data presented in table 13.5 and ANOVA summary from table 13.6. First, the standard deviation is calculated from equation 13.07 as Notice that standard error of measurement value can vary markedly depend-ing on the magnitude of the intraclass correlation coefficient used. Also, note that the higher the intraclass correlation coefficient, the smaller the standard error of measurement. This should be expected because a reliable test should have a high reliability coefficient, and we would further expect that a reliable test would have little trial-to-trial noise and therefore the standard error should be small. However, the large differences between standard error of measurement estimates depending on which intraclass correlation coefficient value is used are a bit unsatisfactory.

o Kruskal-Wallis Pg. 282-284

ata are ranked and there are more than two groups, a nonparametric procedure analogous to simple ANOVA is available called the Kruskal-Wallis ANOVA for ranked data. This procedure produces an H value that, when N > 5, approximates the chi-square distribution. Once we have calculated H, we can determine its sig-nificance by using the chi-square table A.11 in appendix A for df = k − 1, where k is the number of groups to be ranked. An Example From Athletic Training Athletic trainers and coaches want to return athletes to competitive condition as soon as possible after a debilitating injury. Anterior cruciate ligament (ACL) tears repaired with surgery require extensive rehabilitation. An athletic trainer wanted to know whether accelerated rehabilitation (closed kinetic chain activities using weight-bearing exercises) was superior to normal rehabilitation activities (knee extension and flexion exercises) as compared with no special rehabilitation exer-cises (control). Eighteen subjects, each of whom had undergone recent ACL reconstruction with the patellar tendon replacement technique, were selected and divided into three groups: control, normal, and accelerated. After 6 months, three orthopedic physicians evaluated each subject and jointly ranked all 18 according to their level of rehabilitation. Following are the rankings classified according to the type of rehabilitation technique (see table 16.8). Ties are given the average of the two-tied ranks. Clearly, differences exist in the sums and the means. The question we must ask is, Are the differences large enough to be attributed to the treatment effects, or are they chance differences that we would expect to occur even if the treatments had no effect? To solve this problem we apply the following formula for the Kruskal-Wallis H value where N is the total of all subjects in all groups, n is subjects per group, and k is number of groups. From the chi-square table A.11, the critical value for α = .05 for df = 3 − 1 = 2 is 5.991. Because our obtained value of 6.77 exceeds 5.991, we reject the null hypothesis and conclude that significant differences exist somewhere among the three groups. The formula for H assumes that no ties have occurred. If more than a few ties occur, a correction for H has been suggested by Spence and colleagues (1968, p. 217). There is usually little practical value in calculating HC H 3 C = 1− ( tt tt tt NN 1 3 −+ −+...− − 12 3 3 2 = 2, t2 = 2, and k = 2. H = In our example, 6.77 C 33 3 − 1 () () 22 22 18 18 −+ − − ==6.78. 6.77 .9979 , unless the number of ties is large and the value of H is close to the critical value. The correction formula rarely changes the conclusion. To differentiate among the groups, we can calculate the standard error (SE) of the difference for any two values using a procedure suggested by Thomas and Nelson (2001, p. 205 Because we are making three comparisons, we must use a Bonferroni adjustment (i.e., divide by 3) of our rejection α value of .05 (.05/3 = .017). Now we look for pairwise differences at α = .017. This protects against type I errors. For a two-tailed test at α = .05, we expect 2.5% of the area under the normal curve to be in each tail. Using the Bonferroni correction to the p value for three comparisons results in 2.5/3 = .83 at each end of the curve. This leaves 50 − .83, or 49.17% of the curve between the mean and the critical value. Using table A.1 we note that the Z score for 49.17% under the curve is ± 2.39. Multiplying our standard error value, 3.08 × ± 2.39 = ± 7.36 gives us the critical value for pairwise comparisons at α = .017. To apply this value, it is helpful to create a mean differ-ence table (see table 16.9). Therefore, we conclude that the accelerated group is significantly different from the control group at p < .017, but the normal group is not different from either the control or the accelerated group.

Models of interclass

it is useful to know what the different formulas convey and how to interpret the values. First, note that each model is identified using two terms separated by a comma. The number before the comma is used to identify the general model (model 1, 2, or 3). The model 1 equations are sometimes referred to as one-way models because they lump the trials and error components together into one component called within, as described previously. Models 2 and 3 are sometimes referred to as two-way models because they separate the trials and error terms. The term after the comma denotes whether the intraclass correlation coefficient will be based on single measurements or mean values. In our reliability study example, each subject has three Wingate scores. If the intent is to assess the reliability of Wingate testing when in practice only one Wingate test will be administered in the future, then use the model of choice denoted with a 1 after the comma. If in practice the Wingate test will be administered three times and the average of the three will be used as the criterion score in future studies, then use the model of choice denoted with k, where k is the number of trials. For example, if using model 3 with the intent of assessing the reliability of the aver-age of the three Wingate tests, then one could describe the intraclass correlation coefficient as 3,3. The choice of which model to use is not always straightforward. With respect to model 1 (the one-way model), as noted previously this model does not allow for the partitioning of error into separate random and systematic components. However, if one is conducting a rater reliability study, in the one way model the raters are not crossed with subjects, meaning that it can accommodate data where not all subjects are assessed by all raters (Weir, 2005). It is our experience that this situation is rare in kinesiological studies and that all subjects are typically tested by all raters (or tested at each time period in a test-retest study). With respect to model 2 or model 3, the primary distinction centers on whether the intraclass correlation coef-ficient will include both systematic and random error or will only include random error. Model 2 includes systematic error, whereas model 3 includes only random error. In the example data of table 13.1, the mean differences are small; therefore, little systematic error is present. Consequently, the differences between intraclass correlation coefficient values are small. Assume that the intent of the analysis is to assess the reliability of a single Wingate test so we can consider the following intraclass correlation coefficients: ICC (1,1) = .77, ICC (2,1) = .72, and ICC (3,1) = .71. Notice that these values are similar to each other, which reflects the fact that because systematic error is small, the mean square within from the one-way model and the mean square error from the two-way models (reflecting random error) are similar (991.15 and 1,070.28, respectively). Table 13.5 includes another example data set in which we have modified the data in table 13.1 to contain systematic error by adding 50 to 100 watts to each score on trials 2 and 3. Table 13.6 contains the resulting ANOVA summary table from the modified data set. Notice that the mean values from trial 2 and trial 3 are both about 100 watts higher than the mean values from trial 1, suggesting the presence of systematic error. Further, the F for trials is now significant [F(2,18) = 41.44; p < .0001]. With this added systematic error, the intraclass correlation coefficient values are now ICC (1,1) = .24, ICC (2,1) = .37, and ICC (3,1) = .75 (calculations not shown). Notice that models 1,1 and 2,1 are both markedly depressed with the addition of systematic error, whereas model 3,1 improved slightly. These results reflect the influence of systematic error on models 1 and 2, whereas model 3 is reflective of just random error. Given the data in table 13.5, the researcher should consider the source of the systematic error. In this example, it appears that subjects markedly improved from trial 1 to trial 2 but likely reached a plateau in performance from trial 2 to trial 3. It might make sense to include a practice session to wash out practice effects. Indeed, performing intraclass correlation coefficient calculations on just trials 2 and 3 (the plateau region) results in the following values: ICC (1,1) = .78, ICC (2,1) = .70, and ICC (3,1) = .69

Alt approaches to estimating standard error

where MSE From table 13.6, MSE calculated as (13.08) is the mean square error term from the repeated measures ANOVA.This standard error of measurement value does not vary depending on the intra-class correlation coefficient model used because the mean square error is constant for a given set of data. Further, the standard error of measurement from equation 13.08 is not sensitive to the between-subjects variability. To illustrate, recall that the data in table 13.7 were created by modifying the data in table 13.1 such that the between-subjects variability (larger standard deviations) was increased but the means were unchanged. The mean square error term for the data in table 13.1 (see table 13.2, MSE = 1,070.28) was unchanged with the addition of between-subjects variability (see table 13.8). Therefore, the standard error of measurement values for both data sets are identical when using equation 13.08:

singularity

Singularity means two or more independent variables are perfectly related to each other (r = 1.00). This may occur if one variable is created from another by a mathematical manipulation such as squaring, taking the square root, or adding, subtracting, multiplying, or dividing by a constant. Most advanced computer pro-grams screen for multicollinearity and singularity and warn the user in the printout if these relationships are detected by producing the squared multiple correlation values for all variables. See Tabachnick and Fidell (1996, p. 84) for further discus-sion of this issue.

• Tolerance Pg. 143

The denominator of the VIF equation is referred to as tolerance, so that VIF is the reciprocal of tolerance. Some references and software packages evaluate multicollinearity using tolerance and others use VIF, but they contain same info

negative correlations-cor back contin

Negative correlations result when scores on one variable tend to be high num-bers and scores on a second variable tend to be low numbers. For example, the relationship between distance scores and time scores is almost always negative because greater distance (e.g., a far long jump) is associated with faster sprint-ers (e.g., low time scores). In adults, as people get older in years, their muscular strength tends to decrease; therefore, a negative relationship exists between age and strength.

• Muticolinearity Pg. 143

A potential problem in multiple regression analysis occurs if the independent vari-ables in the regression equation are correlated with each other. This is referred to as multicollinearity. Further, in nonexperimental studies, it is quite common for the independent variables to be intercorrelated. We noted earlier the intercorrelations between the independent variables in table 9.2. For example, fat-free weight and weight are correlated with each other at r = .984. Multicollinearity leads to two related problems. First, high multicollinearity widens the confidence intervals around the slope coefficients. In other words, the ability to detect the statistical significance of an independent variable is compro-mised. Second, the wide confidence intervals mean that the slope coefficients are unstable. That is, under conditions of multicollinearity the magnitude of the slope coefficients can change a lot when, for example, another independent variable is added to the equation (sometimes even changing the sign of the slope coefficient). To illustrate, we have forced a third variable, fat-free weight, into the equation (even though it does not significantly increase the R2 ing equation is of the equation). The result-isokinetic torque = −107.02 + 0.85 (weight) + 8.34 (age) + 0.76 (fat-free weight). (9.04) Compare the slope coefficient for weight in the two-variable model (b = 1.45) with the slope coefficient for weight in the three-variable model (b = 0.85). The magnitude of the slope coefficient has decreased by about 42%.

continued factors affecting interclass

Another factor that markedly affects intraclass correlation coefficient values is the amount of between-subjects variability. In general, the more heterogenous the subjects, the higher the intraclass correlation coefficients, and the more homog-enous the subjects, the lower the intraclass correlation coefficients. To illustrate, the data in table 13.7 have been modified from the data in table 13.1. Specifically, we added between-subjects variability to the data by adding or subtracting constants (100-200 watts) to all the scores from selected subjects. The ANOVA summary table for the modified data is presented in table 13.8. First, notice that the mean values do not differ for the data in tables 13.7 and 13.1. However, the standard deviations are larger, reflecting the added between-subjects variability. Second, the effect for trials is the same in tables 13.8 and 13.2. This should be expected because, as addressed in chapter 12, in a repeated measures ANOVA the between-subjects variance does not influence the error term. The intraclass correlation coefficients for the data in table 13.7 are ICC (1,1) = .95, ICC (2,1) = .95, and ICC (3,1) = .95 (calculations not shown). By adding between-subjects variability but keeping the mean values the same, the intraclass correlation coefficient values have markedly increased relative to the results from the analysis of the data in table 13.1. The sensitivity of the intraclass correlation coefficient calculations to the between-subjects variability illustrates that intraclass correlation coefficient calculations are useful for quantifying the relative amount

factors power is dependent on

As figure 10.1 shows, power is dependent on four factors: 1. The Zα level set by the researcher (the level set to protect against type I errors; α = .05, α = .01, and so on). It is represented by a Zα normal distribution table (table A.1) [Zα (0.10) = 1.65, Zα (0.01) = 2.58]. 2. The difference (∆) between the two mean values being compared (∆ = X - 1 − X - 2 , where X - 1 is the mean of the control group and X - the experimental group). 3. The standard deviations of the two groups, which determine the spread of the curves. 4. The sample size, N, of each of the two groups. Only N and Zα are under the control of the researcher, and Zα usually cannot be radically manipulated because of the need to protect against type I errors. Therefore, the researcher can control power primarily by manipulating the size of N

interpreting standard error of measurement

As noted previously, the standard error of measurement differs from the intraclass correlation coefficient in that the standard error of measurement is an absolute index of reliability and indicates the precision of a test. The standard error of measure-ment reflects the consistency of scores within individual subjects. Further, unlike the intraclass correlation coefficient, it is largely independent of the population from which the results are calculated. That is, it is argued to reflect an inherent characteristic of the test, irrespective of the subjects from which the data were derived. The standard error of measurement also has some uses that are especially helpful to practitioners such as clinicians and coaches. First, it can be used to construct a confidence interval about the test score of an individual. This confidence interval allows the practitioner to estimate the boundaries of an individual's true score. The general form of this confidence interval calculation is where T is the subject's true score, SD is the subject's score on the test, and Zcrit is the critical Z score for a desired level of confidence (e.g., Z = 1.96 for a 95% CI). Suppose that a subject's observed score on the Wingate test is 850 watts. Because all observed scores include some error, we know that 850 watts is not likely the subject's true score. Assume that the data in table 13.7 and the associated ANOVA summary in table 13.8 are applicable, so that the standard error of measurement for the Wingate test is 32.72 watts as shown previously. Therefore, we would infer that the subject's true score is somewhere between approximately 785.9 and 914.1 watts (with a 95% LOC). This process can be repeated for any subsequent individual who performs the test. It should be noted that the process described using equation 13.09 is not strictly correct, and a more complicated procedure can give a more accurate confidence interval. For more information, see Weir (2005). However, for most applications the improved accuracy is not worth the added computational complexity.

• ANOVA post-hoc Pg. 187, 190

As we discussed earlier, a significant F alone does not specify which groups differ from one another. It indicates only that differences exist somewhere among the groups. To identify the groups that differ significantly from one another, a post hoc test must be performed. A post hoc test is similar to a t test, except post hoc tests have a correction for familywise alpha errors built into them. Some post hoc tests are more conserva-tive than others. Conservative means that the tests are less powerful because they require larger mean differences before significance can be declared. Conservative tests offer greater protection against type I errors, but they are more susceptible to type II errors. Several post hoc tests may be applied to determine the location of group differ-ences after a significant F has been found. Two of the most commonly used tests, Scheffé's confidence interval (I) and Tukey's honestly significant difference (HSD), are described here. Scheffé permits all possible comparisons, whereas TuKEY

how correlations is useful-core backcontin

Correlation is a useful tool in many types of research. In evaluating testing pro-cedures, correlation is often used to help determine the validity of the measurement instruments by comparing a test of unknown validity with a test of known validity. It is sometimes used to measure reliability by comparing test-retest measures on a group of subjects to determine consistency of performance. However, it is not sensitive to changes in mean values from pre-to posttest. If all subjects improve from pre-to posttest by exactly the same amount, the mean increases but the cor-relation remains the same. (Procedures for quantifying reliability are addressed in chapter 13.) It may also be used as a tool for prediction. When the correlation coefficient between two variables is known, scores on the second variable can be predicted based on scores from the first variable. Although a correlation coefficient indicates the amount of relationship between two variables, it does not indicate the cause of that relationship. Just because one variable is related to another does not mean that changes in one will cause changes in the other. Other variables may be acting on one or both of the related variables and affecting them in the same direction. For example, a positive relationship exists between IQ score and collegiate grade point average, but a high IQ alone will not always result in a high grade point average. Other variables—motivation, study habits, financial support, study time, parental and peer pressure, and instructor skill—affect grades. The IQ score may account for some of the variability in grades (explained variance), but it does not account for all of the variability (unexplained variance). Therefore, although a relationship exists between IQ and grades, it cannot be said that grades are a function of only IQ. Other unmeasured variables may influence the relationship between IQ and grades, and this unexplained variance is not represented by the single correlation coefficient between IQ and grades. Cause and effect may be present, but correlation does not prove causation. Changes in one variable will not automatically result in changes in a correlated variable, although this may happen in some cases. The length of a person's pants and the length of his or her legs are positively correlated. People with longer legs have longer pants. But increasing one's pant length will not lengthen one's legs!

Assumptions in repeated ANOVAS

Except for independence of samples, the assumptions for simple ANOVA (between-subjects designs) discussed in chapter 11 also hold true for repeated measures ANOVA (within-subjects designs). But with repeated measures designs, we must also consider the relationships among the repeated measures. Specifically, repeated measures ANOVA must meet the additional assumption of sphericity. A detailed explanation of sphericity, and a related assumption referred to as compound sym-metry, is beyond the scope of this book. However, a simple way to conceptualize sphericity is to consider a study in which subjects are measured at three time periods: time 1, time 2, and time 3. We can also calculate the difference scores between each time period: time 1 − time 2, time 2 − time 3, and time 3 − time 1. Spheric-ity requires that the variances of the difference scores are all equal. Violations of sphericity will inflate the type I error rate so that if α is set at, for example, .05, the true risk of committing a type I error will be higher than .05. When only two repeated measures are used (such as pre-post measures for a dependent t test), this assumption is not applicable because only one set of differences can be calculated. Methods of dealing with violations of sphericity are presented later in this chapter.

• Variance inflation factor Pg. 143

How much intercorrelation is too much? Unfortunately, no hard and fast rules exist. However, we can calculate some indices that help us quantify the amount of multicollinearity in the data. One such index is called the variance inflation factor (VIF). The idea behind VIF is to calculate the R2 value between an independent variable and the rest of the independent variables in the equation. That is, we treat the independent variable as a dependent variable in a separate multiple regression analysis. Here, when we regress weight on age and fat-free weight, R2 equation for VIF is 1/1-r^2 ere R2 1 1 R − refers to the squared multiple correlation coefficient from regressing one independent variable against the other independent variables. The larger the VIF, the greater the degree of multicollinearity in the data. Variance inflation factor values that exceed 10 should lead investigators to suspect that multicollinearity may be a problem. For the previous example, the VIF = 1/0.02 = 50, which clearly indicates problematic multicollinearity. The denominator of the VIF equation is referred to as tolerance, so that VIF is the reciprocal of tolerance. Some references and software packages evaluate multicollinearity using tolerance and others use VIF, but they convey same info

Heteroscedasticity

If the variance of the residuals is not constant, the data are said to exhibit het-eroscedasticity. To illustrate, figure 8.8 shows a simulation where one-repetition-maximum squat performance is regressed against body weight. Notice that the scatter of the data about the regression line appears to increase as body weight increases. In other words, absolute errors in the prediction of squat strength increase with higher body weights. A residuals plot as shown in figure 8.9 shows this more clearly. The plot of the residuals exhibits a fan pattern. The effect of heteroscedasticity is to inflate type I error, so tests of statistical significance may be invalid. However, the calculation of the slope and Y-intercept estimates using regression still result in the best linear estimates for these param-eters. Both Kachigan (1986) and Tabachnick and Fidell (1996) discuss this con-dition in more detail. Several computational approaches can be used to test for heteroscedasticity beyond just a visual examination of the residuals. In many cases, heteroscedasticity can be corrected by transforming the data (e.g., converting the scores to logarithms).

• Repeated-measure ANOVA Pg. 200-202

One of the most common research designs in kinesiology involves measuring subjects before treatment (pretest) and then after treatment (posttest). A dependent t test with matched or correlated samples is used to analyze such data because the same subjects are measured twice (repeated measures). In ANOVA, this type of design is referred to as a within-subjects design (in contrast to the simple ANOVA presented in chapter 11, which is sometimes referred to as a between-subjects design or between-subjects ANOVA). When three or more tests are given—for example, if a pre-, mid-, and posttest are given with treatment before and after the midtest—ANOVA with repeated measures is needed to properly analyze the differences among the three tests. The simple ANOVA described in chapter 11 assumes that the mean values are taken from independent groups that have no relationship to each other. In this independent group design, total variability is the sum of • variability between people in the different groups (interindividual variability), • variability within a person's scores (intraindividual variability), • variability between groups due to treatment effects, and • variability due to error (variability that is unexplained). However, in that design we are unable to tease apart how much variability is due to interindividual variability, intraindividual variability, and unexplained vari-ability, so all these sources are lumped together as error. In contrast, when only one group of subjects is measured more than once, the data sets are dependent. The total variability for a single group of subjects measured more than once is expected to be less than if the scores came from different groups of people (that is, if the scores were independent) because interindividual variability has been eliminated by using a single group. This tends to reduce the mean square error term in the denominator of F in a manner similar to the correction made to the standard error of the difference in the dependent t test (equation 10.13, p. 161). In the repeated measures design, the subjects serve as their own control. If all other relevant factors have been controlled, any differences observed between the means must be due to (a) the treatment, (b) variations within the subjects (intra-individual variability), or (c) error (unexplained variability). Variability between subjects (interindividual variability) is no longer a factor.

Postive correlations-Corre background contin

Positive correlations result when subjects who receive high numerical scores on one variable also receive high numerical scores on another variable. For example, a positive correlation exists between the number of free throws taken in practice and the percentage of free throws made in games over the season. Players who spend lots of practice time on free throws tend to make a high percentage of free throws in games.

determing power and sample size

Power is the ability of a test to detect a real effect in a population based on a sample taken from that population. In other words, power is the probability of correctly rejecting the null hypothesis when it is false. Generally speaking, as sample size increases, power will increase. Using the concepts and formulas presented in this section, it is possible to calculate how large N must be in a sample to reach a given level of power (i.e. 80% power, or 80% probability that a real effect in the population will be detected). See Tran (1997) for an excellent discussion of the importance of calculating power. Because the critical t values are lower for a one-tailed test than for a two-tailed test (see table A.3), the one-tailed test is considered more powerful at a given α value. Also, a value of α = .05 is more powerful than α = .01 because the t value does not need to be as large to reach the critical ratio. In figure 10.1, the range of the control group represents all the possible values for the mean of the population from which a random sample was taken. The most likely value is at the center of the curve, and the least likely values are at the extremes. The range of the experimental group represents all possible values of the population from which it was taken after treatment has been applied. The alpha point (Zα ) on the control curve is the point at which a null hypothesis is rejected for a given mean value in the experimental group. Any value for the mean of the experimental group that lies to the right of Zα (1 − β area) will be judged significantly different from the mean of control at p < .05, (Z = 1.96). If H0 any value for the mean of the experimental group that lies to the left of Zα is false, this represents a type II error. is really true, this represents a type I error. Conversely, (the beta area) will be judged to be not significantly different from the mean of control group. If H0 It then follows that the area of 1 − β in the experimental group is the area of power, the area where a false null hypothesis will be correctly rejected. This area represents all of the possible values for the mean of the experimental population that fall beyond the Zα level of the control population. Power is calculated by determining Zβ , converting it to a percentile using table A.1 in appendix A, and adding this percent of area to the 50% of the curve to the right of the experimental mean. In figure 10.1, 1 − β represents 70.5% of all the possible mean values for the experimental group. A 70.5% chance exists that a false null hypothesis will be rejected; power = 70.5%. Let's consider how power is calculated.

• Independent and dependent t-test background Pg. 150

Recall from chapter 7 that a sample mean may be used as an estimate of a popula-tion mean. Remember also that we can determine the odds, or probability, that the population mean lies within certain numerical limits by using the sample mean as a predictor. This same technique can be used in reverse to determine whether a given sample is likely to have been randomly selected from a specific population. If the population mean is known or assumed to be a certain value, and if the sample mean is not close enough to the population mean to fall within the limits set by a selected level of confidence, then one of the following conclusions must be true: (a) the sample was not randomly drawn from that population, or (b) the sample was drawn from the population, but it has been modified so that it is no longer representative of the population from which it was originally drawn. Using similar logic, we can make conclusions about two sets of data. If we draw two samples from the same population, and the means of these samples differ by amounts larger than would be expected based on normal distributions, one of the following conclusions must be true: (a) one or both of the samples were not randomly drawn from the population, or (b) some factor has affected one or both samples, causing them to deviate from the population from which they were originally drawn. When a sample is drawn from a population with a known or estimated mean (μ) and standard deviation (σ), the probability (or odds) that the mean of a ran-domly drawn sample (X - ) will lie within certain limits of μ can be determined. To ascertain the probability that a given sample came from a certain population, the value of the standard error of the mean must be calculated by one of the

• Homoscedasticity

Recall from chapter 7 that an assumption of parametric inferential statistics is that the data are normally distributed. A further assumption of regression analysis is called homoscedasticity. If data exhibit homoscedasticity, this means that the variance of the residuals does not vary (Lewis-Beck, 1980). To illustrate, figure 8.7 shows a plot of the residuals (Y-axis) from table 8.3 versus the number of push-ups. An examination of the residuals in figure 8.7 shows no apparent relation-ship between size of the residuals and the magnitude of the independent variable.

Methods of multiple regression

Two approaches exist for generating multiple regression equations. In one approach, the investigator uses a computer algorithm to generate the equation. Common tech-niques include procedures called forward selection, backward elimination, and stepwise. In all of these approaches, the investigator records the scores on the independent and dependent variables from a representative sample from the population of interest. For example, investigators have developed regression equations to predict percent fat of high school wrestlers from skinfold thickness of various body sites. These are used to determine acceptable weight classes for the competitors (Thorland et al., 1991). That is, the equations are used to put limits on how much weight the wrestlers can lose based on their estimated body fat percentages. The population here is all high school wrestlers. The dependent variable (Y) is the body fat percentage from hydrostatic weighing and the independent variables are the various skinfold thick-nesses. All the data are entered into the computer data file and, using the particular algorithm chosen by the investigator, the computer builds the prediction equation. In the other approach, the investigator specifically tells the computer how to construct the regression equation. Hierarchical multiple regression is the process in which we set up a hierarchical order for inclusion of the independent variables. This may be useful if some of the independent variables are easier to measure than others or if some are more acceptable to use than others. Hierarchical approaches are also used if the investigator is examining a specific model. This model building approach is typically used in situations in which the investigator is not interested in prediction per se, but is building a statistical model to test a substantive hypothesis. In the hierarchical approach, the investigator specifies to the computer the order in which to enter variables into the equation.

• Assumptions of t-test Pg. 153

Several assumptions must be met for the t test to be properly applied. If these assumptions are not met, the results may not be valid. When the investigator knows that one or more of these criteria are not met, a more conservative (i.e., α = .01 rather than α = .05) level should be selected to avoid errors. This allows us to be confident of the conclusions and helps to compensate for the fact that all assumptions were not met. The t test is quite robust; it produces reasonably reliable results, even if the assumptions are not met totally. The t test is based on the following assumptions: • The population from which the samples are drawn is normally distributed. (See chapter 6 for methods of determining the amount of skewness and kurtosis in a data set.) • The sample or samples are randomly selected from the population. If the samples are not randomly selected, a generalization from the sample to the population cannot be made. • When two samples are drawn, the samples have approximately equal vari-ance. The variance of one group should not be more than twice as large as the variance of the other. This is called homogeneity of variance. • The data must be parametric. The differences between parametric and nonparametric data are examined in more detail in chapter 16. For now we

• ANOVA Assumptions Pg. 181

The F test is based on the following assumptions: • The population(s) from which the samples are drawn is normally distributed. Violation of this assumption has little effect on the F value among the samples (Keppel, 1991, p. 97). The F test produces valid results even when the popula-tion is not normally distributed. For this reason it is considered to be robust. • The variability of the samples in the experiment is equal or nearly so (homo-geneity of variance). As with the assumption of normality, violation of this assumption does not radically change the F value. However, as a general rule, the largest group variance should not be more than two times the smallest group variance. • The scores in all the groups are independent; that is, the scores in each group are not dependent on, not correlated with, or not taken from the same subjects as the scores in any other group. The samples have been randomly selected from the population and randomly assigned to various conditions for each group. If a known relationship exists among the scores of subjects in the several groups, use repeated measures ANOVA (see chapter 12). The data are based on a parametric scale, either interval or ratio. (For non-parametric data analysis, see chapter 16.) The F test, like the t test, is considered to be robust. It provides dependable answers even when violations of the assumptions exist. Violations are more critical when sample sizes are small or Ns are not equal. If violations are committed that cannot be controlled and that the researcher thinks may increase the possibility of a type I error, a more conservative alpha value should be used to compensate for the violations (i.e., use α = .01 rather than α = .05).

• Greenhouse geisser Pg. 205-206

The Greenhouse-Geisser (GG) adjustment consists of dividing the dftime becomes dftime same method so that dferror and dferror values by T − 1 (the number of repeated measures minus 1). Note that T − 1 is the same as the value dftime dftime (equation 12.06). Hence, in the GG adjustment, the value /(T − 1) = 1. Degrees of freedom for error are adjusted by the becomes dferror /(T − 1). In the bicycle research example, the adjusted dferror = 36/(5 − 1) = 9. We then reevaluate F using tables A.4, A.5, and A.6 and the adjusted df (1, 9). For df (1, 9), table A.6 indicates that F must equal 10.56 to be significant at α = .01. Because our obtained value (18.36) exceeds 10.56, we may still conclude that the means of two or more trials are significantly different at p < .01. This application of the GG adjustment assumes maximum violation of the assumption of sphericity. Consequently, when the violation is minimal, this adjust-ment of degrees of freedom may be too severe, possibly resulting in a type II error (accepting H0 when it is false).

tukeys HSD

The Tukey's HSD test is named after John Tukey (1915-2000), who first developed the procedure. Tukey's test, like Scheffé's, calculates the minimum raw score mean difference that must be attained to declare significance between any two groups. However, Tukey's test does not permit all possible comparisons; it permits only pairwise comparisons. Any single group mean may be compared with any other group mean. The formula for HSD is where q is a value from the Studentized range distribution (see tables A.7, A.8, and A.9 in appendix A) for k and dfE at a given confidence level (K is used not degrees of freedom between) MSe is the mean square error value from ANOVA, n is the size of groups modified to compare any two groups with unequal values of n as follows: HSThese values (2.40 and 1.95, rounded to the nearest 100th) represent the minimum raw score differences between any two means that may be declared significant. Tukey, a more liberal test, confirms Scheffé but also finds that groups 2 and 1 differ at p < .01 (see table 11.6). The values at p = .01 and p = .05 are both lower in Tukey's HSD test than in Scheffé's I. This makes HSD more powerful (i.e., more likely to reject H0 ) than Scheffé's I. Because we started with the null hypothesis and are not making any comparisons other than pairwise (i.e., we are not interested in the combined mean of two or more groups compared with other combined means), Tukey's test is appropriate. Scheffé may be too conservative for this research design. Based on the analysis by Tukey, group 2 is significantly different from groups 1, 3, and 5 at p < .01. Figure 11.1 presents the results in bar graph form. The T symbol above each bar represents standard deviation. at a given level of confidence (note that k is is the mean square error value from the ANOVA analysis, and n is the size of the groups. Equation 11.16 assumes that the n values in each group are equal. It may b Because we started with the null hypothesis and are not making any comparisons other than pairwise (i.e., we are not interested in the combined mean of two or more groups compared with other combined means), Tukey's test is appropriate. Scheffé may be too conservative for this research design. Based on the analysis by Tukey, group 2 is significantly different from groups 1, 3, and 5 at p < .01. Figure 11.1 presents the results in bar graph form. The T symbol above each bar represents standard deviation.

• Curvilinear

Two variables may be related in a curvilinear fashion. Pearson's correlation coefficient is a measure of linear rela-tionships. If it is applied to curvilinear data, it may produce a spuriously low value with a corresponding interpretation that the relationship is weak or nonexistent when in fact the relationship is strong but not linear. Figure 8.4 shows a scatter plot for data that have a curvilinear relationship. The curved line represents the true relationship between the X and Y variables. The straight line represents the relationship assumed by Pearson's coefficient. The true relationship is curvilinear and strong. The spurious linear relationship is weak and incorrect. It is important to examine the scatter plot of the data in addition to calculating the correlation coefficient. The pattern of scores on the scatter plot

calculating power

The following process is used to calculate power—that is, to determine the 1 − β area in figure 10.1. The researcher sets Zα based on the α value set to reject the null hypothesis. The values for the means, standard deviations, N for each group, the standard error of the difference, and t are calculated. Figure 10.1 demonstrates that t is the sum of Zα and Zβ 2 and Zβ . mean (X - In figure 10.1, Zα 1 was made with X - 1 ). The t value is negative (-2.50) because X - > X - : , and Zβ is a positive value; it proceeds to the right of the control 2. The Zβ value (-.54) is 1 < X - also negative; it proceeds to the left of the experimental mean X2 2, then t and Zβ and Zβ would be positive values and Zα . If the analysis would be negative. In order for the following formulas to be applied toward either tail of the curve, the values of t, Zα the sum of Zα t = Zα Conversely, Zβ - 1 = 30, X - 2 = 32.5 (∆ = 2.5), = t − Zα . Let us assume the following data apply to figure 10.1: X SD = 5 for each group, N = 50 for each group, SEM (10.21) + Zβ . will all be considered absolute. Then t is equal to (10.20) . To determine the power area of the experimental curve, we must find the value of Zβ, which is the percent of the area on the experimental curve between We convert the Zβ value of 0.54 to a percentile from table A.1 (Zβ = 0.54 = 20.5% of the area under the normal curve) and compute the area of 1 − β (20.5% + 50% = 70.5%). Therefore 70.5% of all possible values of the experimental population mean lie to the right of Zα null hypothesis if the values given in the previous data section are true; power = 70.5%.

• Multiple regression background Pg. 137

The general formula for multiple regression is represented as YP = a + b1 where b1 , b2 , b3 , . . . , bk X1 + b2 X2 + b3 X3 ... + bk Xk , (9.01) are slope coefficients that give weight to the independent variables according to their relative contributions to the prediction of Y. The number of predictor, or independent, variables is represented by k, and a is a constant that is similar to the Y-intercept. When raw data are used, the b values are in raw score units. Sometimes multiple regression is performed on standardized scores (Z) rather than on raw data. In such cases, all raw scores are converted to the same scale, and the b values are then referred to as beta (ð weights. Beta weights perform the same function as b values; that is, they give relative weight to the independent variables in the prediction of Y. In common usage, b values are sometimes called beta weights, but the word beta is only properly used when the equation is in a Z score format. Multiple regression is used to find the most satisfactory solution to the prediction of Y, which is the solution that produces the lowest standard error of the estimate (SEE ). Each predictor variable is weighted so that the b values maximize the influ-ence of each predictor variable in the overall equation.

Calculating sample size

The only factor in these equations that is easily manipulated by the researcher is N. We could increase our power by increasing N, but how large does N need to be to produce a given power level? Equation 10.20 may be solved for N as follow t= Zα ∆ + Zβ With equation 10.22, we can determine the N needed for a given power level from table A.1, we must find the Z value where 30% of the area of the if we know the other values. Suppose we want power of .80 at two-tailed α = .05. SEE NOTEBOOK FOR EQ normal curve is below Z and 20% is above Z. This will correspond to β = .20 and power = .80. We look up 30% in table A.1 to find that Zβ for each group SD = 6, and We conclude that to achieve 80% power under these conditions, we must use a sample size of approximately 23 in each group. The calculation of power is a major factor in experimental design. It is impor-tant to know what the odds are that real differences between group means may be detected before we conduct expensive and time-consuming research. Research performed with insufficient power (i.e., N is too small) may result in a type II error (failure to reject a false null hypothesis) or may waste valuable resources on a study that has little chance of rejecting the null. In a power calculation, the values for the means and standard deviations are not usually known beforehand. To calculate power before the data are collected, these values must be estimated from pilot data or from prior research on similar subjects. The previous power calculation example is applicable only to a t test of indepen-dent means, with equal values of both N and SD. This is the simplest application of the concept of power. Similar calculations may be made for unequal values of N or for dependent tests. A variety of software programs can be used to perform power calculations for simple research designs. Additional discussions of power and sample size may be found in Kachigan (1986, p. 185) and Thomas, Nelson, & Silverman (2005), p. 116-119.

• Interclass correlation Pg. 216-222

The reliability coefficient is formally quantified using what is called an intraclass correlation coefficient (ICC). The Pearson r from chapter 8 is sometimes called an interclass correlation coefficient because two variables are correlated (e.g., height and weight). Here, the intraclass correlation coefficient is "intraclass" because we are calculating an index that comprises the same variable measured on multiple occasions. The intraclass correlation coefficient should look somewhat familiar because we discussed variance ratios in earlier chapters with r squared (chapter 8), R2 (chapters 9 and 11), and ω2 (chapter 11). Table 13.1 contains example data for a small reliability trial of the Wingate anaerobic power test. Ten subjects performed three Wingate tests from which peak anaerobic power (watts) was determined. Each test was performed 1 week apart. Our first step is to perform a repeated measures ANOVA on the data in table 13.1. We examine the effects for trials to determine whether systematic differences appear across the trials. That is, systematic error may be of concern if the means of the trials are significantly different from each other. The ANOVA summary table for the analysis is presented in table 13.2. From table 13.2, we see that the effect for trials is not significant [F with 2 df in the numerator and 18 df in the denominator: F(2,18) = 0.26, p = .77]. Further, examination of the means across trials at the bottom of table 13.1 shows relatively small mean differences. Therefore, no systematic error is apparent across the three trial periods. We now have the necessary variance components from table 13.2 to calculate intraclass correlation coefficients. We use plural here because intraclass correlation coefficients can be calculated in several ways, depending on the situation. However, we must first introduce some terminology to be consistent with what is reported inhe literature. For some intraclass correlation coefficients, it has become common to refer to the information in the "subjects" source as the "between" subjects vari-ability because it reflects how subjects differ from each other. Therefore, for some intraclass correlation coefficients, the mean square for subjects is referred to as mean square between (MSB within (MSW ). We also need to create a new term called mean square ) that is the composite of the mean square for trials and the mean square for error. The sum of squares for trials is 557.9 and the sum of squares for error is 19,265.1, the sum of which is 19,823 (sum of squares within). The degrees of freedom for trials is 2 and the degrees of freedom for error is 18, the sum of which is 20 (degrees of freedom within). Therefore, MSW = 19,823/20 = 991.15. With these terms in mind, we have modified the ANOVA summary table as shown in table 13.3. Shrout and Fleiss (1979) described six versions of the intraclass correlation coefficient, depending on the situation. The computational formula for each of these models is shown in table 13.4. The relationship between the computational formulas presented in table 13.4 and the conceptual equations of 13.03 and 13.04 is not intuitively obvious and is beyond the scope of this chapter. The interested reader is referred to in-depth reviews by Shrout and Fleiss (1979), McGraw and Wong (1996), Looney (2000), and Weir (2005).

• Independent t-test Pg. 156

The same concepts used to compare one sample mean to a population mean may be applied to compare the means between two samples. In essence, we are testing whether the two samples were drawn from the same population. If the t ratio exceeds the critical ratio from table A.3a, the null hypothesis H0 is rejected and we infer that the two samples were not drawn from the same population. If the t ratio does not exceed the critical ratio, H0 is accepted and we infer that the samples were drawn from the same population. The t test in this situation is referred to as an independent t test. As an example, say an athletic trainer wishes to examine whether a new treatment procedure for ankle sprains results in less ankle swelling than standard treatment 24 hours following the injury. Over several months, 30 athletes with ankle sprains are randomly assigned to receive either the standard care or new treatment procedure. Ankle swelling is measured using water displacement and the difference in foot and ankle volume (ml) between the injured and noninjured limbs is calculated. Table 10.1 shows example data from this study. Note that, in order to minimize the influence of rounding error on the calculations, we are reporting X - and SD to decimal values beyond the level of precision of the original measurements.

• ETA squared Pg. 192

The simplest measure of the effects of treatment in ANOVA is eta squared, sym-bolized by R2 : R squared (R2 ance. Note that this is the same R2 2 = SS SS B . T (11.18) ) is the ratio of the variance due to treatment and the total vari-(and r2 ) that we address in chapters 8 and 9. It produces a rough estimate of the size of the effect from a study comparing groups means as it does when assessing the magnitude of effect (variance accounted for) from regression. In the data from the strength training experiment (table 11.5), R2 = 50.82/98.17 = .52. This means that 52% of the total variance can be explained by the treatment effects. The remaining 48% is unexplained. The value of R2 the treatment effects. It is a measure of the proportion of the total variance that is explained by the treatment effects. In this case, a little more than half of the total variance is explained. This is a fairly large proportion and confirms the conclusion that at least one treatment was effective in improving strength. In some statistics books (see Tabachnick and Fidell, 1996, p. 53), R2 is called eta squared (η2 A variant of eta squared is called partial eta squared (partial ð statistical software routinely reports partial η2 is similar to η2 . In fact, in simple ANOVA partial η2 ). ). Some popular . As the name suggests, partial η2 and η2 are exactly the same. However, in more complicated ANOVA designs (i.e., factorial ANOVA in chapter 14), the values differ. These values are ways of assessing how much variance in the dependent variable is accounted for, or explained by, the independent variable.

Calculating repeated measures ANOVA

To demonstrate how to calculate ANOVA with repeated measures, we analyze a hypothetical study. A graduate student studying motor behavior was interested in the decrease in balance ability that bicycle racers experience as their fatigue increases. To measure this, the researcher placed a racing bicycle on a roller ergometer. A 4-inch-wide stripe was painted in the middle of the front roller, and the rider was required to keep the front wheel on the stripe. Balance was indicated by wobble in the front wheel and was measured by counting the number of times per minute that the front wheel of the bike strayed off the 4-inch stripe over a 15-minute test period. As the test progressed, physiological fatigue increased and it became more and more difficult to maintain the front wheel on the stripe. The 15-minute test period was divided into five 3-minute periods for the purpose of collecting data. Data were collected on the number of balance errors during the last minute of each 3-minute period. In this design, the dependent variable is balance errors and the independent variable is time period (we call this variable "time"), which reflects the increase in fatigue. Table 12.1 presents the raw data in columns and rows. The data (in errors per minute) for the subjects (N = 10) are in the rows, and the data for time (k = 5 repeated measures) are in the columns. The sum of rows (∑ R) is the total score for each subject over all five time periods, and X - subjects at the right denotes the mean across time for each subject. The sum of each column (∑ C) is the total for all 10 subjects on a given trial; ∑ XT is presented at the bottom of the table. Remember that ANOVA stands for analysis of variance. We analyze the vari-ance by breaking the total variance in the data set into the relevant pieces. For the repeated measures ANOVA, we partition the total variance into pieces attributed to (1) differences between measurement periods [for the example in table 12.1, these are the differences between time periods (columns) at minute 3 and minute 6 and so on], which is represented by how the means across time differ; (2) differences between subjects; and (3) unexplained variance (the error or residual). Notice that in contrast to the between-subjects ANOVA presented in chapter 11, where we could partition the total variance into only two pieces [between groups and within groups (error)], in repeated measures ANOVA we have added a third piece. The third piece is the component attributable to differences between subjects. Because each subject provides a score for each time period, we can estimate how much of the total variance is due simply to different abilities of different subjects. As noted previously, in between-subjects

Standard deviation residuals

To demonstrate how to calculate the error in prediction we use the push-up example from figure 8.6. Each data point has an error factor (the distance from the point to the best fit line), called a residual. These distances represent the part left over between each predicted Y value and the actual value. Sometimes they are called residual errors because they represent the error in the prediction of each Y value. One would predict that the wrestler who did 58 push-ups (see figure 8.6) would last about 55 seconds (YP ). If he actually lasted 51 seconds (Y), the residual error would be −4 seconds, the difference between the actual (Y = 51 seconds) and the predicted (YP = 55 seconds) values. The best fit line represents the best prediction of Y for any X value. Some residuals are large, and some are quite small; indeed, some points fall right on, or very close to, the line. By using the algebraic solution for the best fit line and the residuals, we can calculate the predicted value for Y and the amount of error in the prediction. For example, what is the prediction of Y and the error of prediction for a different subject (not one of the original 15) who did 60 push-ups? By substitution into the prediction equation we get YP = 12.878 seconds + 0.726 seconds per push-up (60 push-ups) = 56.44 seconds. Thus, the predicted Y value is 56.44 seconds. But this prediction has some degree of error. To determine the error in prediction, we could use the generalized equation to predict a Y value for each subject and then compare the predicted Y value (YP in prediction for each subject (Y − YP in table 8.3. The values for Y − YP ) with the actual Y value to determine the amount of error ). The result of such an analysis is presented represent all the errors, or residuals, around the best fit line. Notice that the sum of the residuals is zero. Also, note that the sum of squares of the residuals is 477.825. We can assume that the residuals are randomly distributed around the line of best fit and that they would fall into a normal curve when plotted. In chapter 5, we defined standard deviation as the square root of the average of the squared deviations from the mean. Therefore, the standard deviation of the residuals can be calculated with a modification of equation 5.03 as follows:

o Friedman's test Pg. 284-286

When subjects are measured three or more times using ranked data, or if interval or ordinal data are converted to ranks, a nonparametric procedure similar to repeated measures ANOVA is used. Friedman's two-way ANOVA by ranks computes a chi-square value for the differences between the sum of the ranks for each repeated measure. This test was developed by statistician and economist Milton Friedman (Friedman, 1937), who received the Nobel Prize in economics in 1976. This test is analogous to the single factor repeated measures ANOVA but is called two-way because "subjects" is a factor in repeated measures ANOVA. An Example From Physical Education A researcher wanted to know whether physical education was judged by students to be more or less popular than selected academic classes. Using middle school students as subjects, the researcher asked 10 students to rank physical education, math, and English according to how well they liked the classes; 1 represented the most-liked class, 2 the class in the middle, and 3 the least-liked class. Table 16.10 presents the fabricated results of the hypothetical survey. Friedman's formula to compute chi-square among the sums of the Using df = k − 1 = 3 − 1 = 2 and α = .01, we note from table A.11 that the critical chi-square = 9.210. Because the obtained value (11.4) is greater than the critical value, we reject the null hypothesis and conclude at p < .01 that students like physical education best. Note: Data are fabricated.

• Bivariate Regression background Pg. 117-118

When the correlation between two variables is sufficiently high, we can predict how an individual will score on variable Y if we know his or her score on variable X. This is particularly helpful if measurement is easy on variable X but difficult on variable Y. For example, it is easy to measure the time it takes a person to run 1.5 miles and difficult to measure V . correlation between these two variables is known, we can predict V . . O2 max. O2max on a treadmill or bicycle ergometer. If the max from O2 1.5-mile-run time. Of course, this prediction is not perfect; it contains some error. But we may be willing to accept the error to avoid the difficult and expensive direct measure of V The following example illustrates how an individual's score on one variable can be used to predict his or her score on a related variable. Suppose we gave two tests of muscular endurance to a group of 15 high school wrestlers. One (X) tests the number of push-ups they can perform in two minutes. The other (Y) tests the number of seconds they can maintain a hand grip isometric force of 50% of maxi-mum until failure. Table 8.3 shows the data. The correlation coefficient between the push-up and isometric fatigue scores is r = .845 (we use three decimal places to add precision to calculations that use r) and the 95% CI = 0.59 to 0.95. Next, we plot the values for each subject on a scatter plot, as shown in figure 8.6. The push-up scores (independent variable) are plotted on the X-axis and the isometric time to failure scores (dependent variable) are plotted on the Y-axis. Note the example of a data point in figure 8.6 for a person who did 43 push-ups and held 50% isometric force for 60 seconds. As discussed earlier in this chapter, a scatter plot of the data clusters around the best fit line. When the best fit line is known, any value of X can be projected vertically to the line, then horizontally to the Y-axis, where the corresponding value for Y may be read. The best fit line is graphically created by drawing it in such a way that it bal-ances the data points. That is, the total vertical distance from each point below the line up to the line is balanced by the total vertical distance from each point above the line down to the line. The average distance of the points above the line is the same as the average distance of the points below the line. The vertical distance from any point to the line is called a residual. Residuals will be positive and negative and the sum of residuals will be equal to zero (in the same way that the sum of the deviation scores from the mean sum to zero). Using the best fit line, we can predict the time to failure that would be per-formed based on the number of push-ups performed. We simply find the push-up score on the X-axis—for example, 58 push-ups—and proceed vertically up to the line and then horizontally over to the Y-axis to read the estimated time to failure. The estimated time to failure is about halfway between 50 and 60, so we estimate that the wrestler would have reached failure in about 55 seconds. T

more on paired t test

ach other. However, in situations in which a researcher tests a group of subjects twice, such as in a pre-post comparison, the groups are no longer independent. Dependent samples assume that a relationship, or correlation, exists between the scores and that a person's score on the posttest is partially dependent on his or her pretest score. The type of t test used in this situation is called a dependent t test (also called a paired t test). This is always the case when the same subjects are measured twice. A group of subjects is given a pretest, subsequently treated in some way, and then given a posttest. The difference between the pretest and posttest means is computed to determine the effects of the treatment. This arrangement is often referred to as a repeated measures design or within comparison. Because both sets of scores are made up of the same subjects, a relationship, or correlation, exists between the scores of each subject on the pre-and posttests. The differences between the pre-and posttest scores are usually smaller than they would be if we were testing two different groups of people. Two test scores of a single person are more likely to be similar than are the scores of two different people. If a positive correlation exists between the two sets of scores, a high pretest score is associated with a high posttest score. The same is true of low scores. Consequently, we have removed a source of noise from our data; specifically, differences between subjects. That is, some of the noise (reflected by the denominator of the t ratio) is due to the fact that different people make up the means in an independent t test. In a dependent t test, these between subjects differences are eliminated. This same argument holds true for studies using matched pairs, pairs of sub-jects who are intentionally chosen because they have similar characteristics on the variable of interest. These matched pairs—sometimes called research twins—are then divided between two groups so that the means of the groups on the pretest are essentially equal. One group is treated, and the other group acts as control; then the posttest means are compared. We expect the matched group data to have less noise than if the two groups were not matched on the pretest. In effect, we have forced the groups to be equal on the pretest so that posttest comparisons may be made with more clarity. The matched twins in each group may be considered to be the same person, and the correlation between them on the dependent variable can be calculated. To accomplish this matching process, all the subjects are given a pretest and then ranked according to score. Using a technique sometimes referred to as the ABBA assignment procedure, the researcher places the first (highest scoring) subject in group A, the second and third subjects in group B, the fourth and fifth in group A, the sixth and seventh in group B, and so forth until all subjects have been assigned. The alternation of subjects into groups ensures that for each pair of subjects (1 and 2, 3 and 4, 5 and 6, and so on) one group does not always get the higher score of the pair. This technique usually results in a correlation between the groups on the dependent variable and in smaller mean differences on the posttest.

• Paired t-test Pg. 159

eq 10.11 produces the same answer as equation 10.07 when N1 be used in every case. Computers use this equation so it can compute SED and N2 = N2 , so it could in any case. When the values of N are large and only slightly unequal, the error introduced by using the simpler equation 10.07 to compute the standard error of the difference is probably not critical. But when the values of N1 a ratio of 2:1, the error introduced by equation 10.07 is considerable. If any doubt exists about which equation is appropriate, equation 10.11 should be used so that maximum confidence can be placed in the result. An Example From Biomechanics The following example applies equation 10.11 to the mean values obtained in a laboratory test comparing hip and low-back flexibility of randomly selected males and females. The following measurements in centimeters were obtained using the sit-and-reach test: he degrees of freedom are determined using df = (10 − 1) + (8 − 1) = 16. is rejected and H1 is accepted. The researcher concludes with better than 95% Table A.3a indicates that for df = 16, a t ratio of 2.40 is significant at p < .05. So H0 confidence that females are more flexible than males in the hip and low-back joints as measured by the sit-and-reach test. The t value is negative because a larger value was subtracted from a smaller value in the numerator. The sign of the t ratio is not important in the interpretation of t because it may be positive or negative depending on which group is listed first in the numerator. Only the absolute value of t is considered when determining sig-nificance and it is considered bad form to report negative t values (Streiner, 2007). Repeated Measures Design (a Within Comparison) The standard formulas for calculating t assume no correlation between the groups. Both groups must be randomly selected from the population and independent of

• Effect size Pg. 164-165

it It is common to report the probability of error (p value) reached by the t ratio. Declaring t to be significant at p = .05 or some similar level only indicates the odds that the differences are real and that they did not occur by chance. This is often termed statistical significance. But we must also consider practical significance. If the values of N are large enough, if standard deviations are small enough, and especially if the design is repeated measures, statistically significant differences may be found between means that are quite close together in value. This small but statistically significant difference may not be large enough to be of much use in a practical application. How important is the size of the mean difference? Thomas and Nelson (2001, p. 139) suggest the use of omega squared (ω2 ) to determine the importance, or usefulness, of the mean difference. Omega squared is an estimate of the percentage of the total variance that can be explained by the influence of the independent variable (the treatment). For a t test, the formula for omega squared is t treatments. Note that this is analogous to the coefficient of determination (r2 ) from chapter 8. The remaining 89% of the variance is due to individual dif-ferences among subjects, other unidentified factors, and errors of measurement. How large must omega squared be before it is considered important? The answer to that question is not statistically based. Each investigator or consumer of the research must determine the importance of omega squared. In this example, it is meaningful to know that 11% of the variance can be explained. But other factors, such as severity of injury and individual differences in swelling, also contribute to the variance in ankle swelling. Another method of determining the importance of the mean difference is the effect size (ES), which may be estimated by the ratio of the mean difference over the standard deviation of the control group, or the pooled variance of the treatment groups if no control group exists: The control group is normally used as an estimate of the variance because it has not been contaminated by the treatment effect. In the ankle sprain example, the effect size is Jacob Cohen (1988, p. 21), as quoted in the work of Winer and colleagues (1991, p. 122), proposes that effect size values of 0.2 represent small differences, 0.5 represent moderate differences, and 0.8+ represent large differences. Winer and colleagues (1991) also suggest that effect size may be interpreted as a Z (stan-dardized) score of mean differences. In the ankle swelling example, ES = 0.72 indicates that the effect of the treatment was moderate. Although these guidelines are helpful, do not slavishly follow these standards because the magnitude of effect is best judged in the context of the research question. An effect size of 0.5 may be extremely important in one situation but trivially small in another. The amount of improvement from the pretest to the posttest in repeated mea-sures designs can be determined by assessing the percent of change. The following formula determines the percent of change (improvement) between two repeated measures

• Standard error of the estimate

standard error of the estimate may be interpreted as the standard deviation of all the errors, or residuals, made when predicting Y from X. It is calculated using the sample data that was used to generate the prediction equation, and has the same units as the units of the dependent variable (Y). Because the standard error of the estimate is the standard deviation of a set of scores (the residuals) that are normally distributed, the standard error of the estimate can be interpreted as a Z score of ±1.0. We know that 68% of all errors of prediction will be between ±1 × SEE , 90% will be between ±1.65 × SEE 99% will fall between ±2.58 × SEE , 95% will be between ±1.96 × SEE , and . (See table 7.1 on page 89 for a review of the relationships between Z, level of confidence, and p.) Equation 8.06 is generally not used with hand calculations because of the tedious calculations involved. Another formula for the standard error of the estimate is easier to use. This formula requires only the standard deviation of the Y variable and r: Equation 8.07 underestimates the standard error of the estimate when sample sizes are small (Hinkle et al., 1988), which explains the numerical dif-ference in the standard error of the estimate calculated from equations 8.06 and 8.07. Equation 8.07 is more commonly used in statistics texts, but equa-tion 8.06 is more accurate and reflects the calculation used in most statistical software. We are now prepared to estimate error for the predicted Y value for a subject who performed 60 push-ups. Remember that we are using sample data to predict population values, so the standard error of the estimate allows us to predict an


Set pelajaran terkait

Ch 20: Drug Therapy w/ Tetraclycines, Sulfonamides, & Urinary Antiseptics

View Set

LC4: LearningCurve - Ch. 4: Equilibrium: How Supply and Demand Determine Prices

View Set

Math(graphing radical functions )

View Set

Module 1 C4 Insurance - Life Insurance

View Set