EVAL CH. 6
Define validity.
Does the test measure what it purports to measure?; degree of truthfulness of a measure RELIABILITY + RELEVANCE; A test may be valid in one set of circumstances but not in another! accuracy
Explain what we mean when we say that R&V are population dependent. Is a test that is determined to be valid in DI athletes necessarily valid in high school athletes?
Necessary to have subjects who are similar only valid or reliable for certain population The reliability or validity obtained is specific to the group being tested, the environment of testing, and the testing procedures.
Define objectivity.
Rater reliability - scores are consistent from one grader to another because of well-defined scoring system for the correct/best answer Test can be objective but not reliable or valid
How can you correct systematic error?
re-calibrating the system
The focus of reliability is on ________ error.
RANDOM -we want to decrease random error -variance is a good indicator of error
Methods of Obtaining a Criterion Measure:
-Actual participation/performance/outcomee.g., golf, archery, do the job (e.g., a UPS worker, military), NFL performance, graduate school performance, heart disease developed later in life -Known valid test/gold-standard test/ criterion measuree.g., lab-measured VO2max, DEXA-determined % body fat -Expert judges Panel judges/coaches determine who is best, 2nd, 3rd best -Tournament participation Round robin (who wins when everyone plays with everyone else in a tennis tournament is the "best")
Recognize (don't memorize) 6 ways to reduce measurement error.
-choose reliable and valid tests train testers give clear instructions demonstrate or provide an example provide warm-up and practice -ensure optimal environment
A reliability coefficient will be between __ and __.
0-1
Know three limitations for using PPMCC for reliability. Know that these are reasons for why the intraclass correlation coefficient is superior to the interclass correlation coefficient in determination of reliability.
1. Only can use for two trials, not more. 2. PPMCC is technically "bivariate" statistic, and reliability is testing same thing twice. 3. PPMCC can be misleading and not detect a substantial change in mean from trial to trial. PPMCC only assesses relative reliability...i.e., do Ss maintain their rank or position in the group from trial to trial.) This is misleading because PPMCC does not detect systematic differences that occur from trial to trial, of which PPMCC is a bivariate statistic while reliability is testing the same thing twice
Know how you determine test-retest reliability. That is, what are the steps you take? How might you show this data on a scatterplot? (e.g., what are axes? Hint: (x)trial 1 and (y)trial 2). When you have perfect reliabiity know that all points are on the "line of identity".
1.Select a defined group of > 30 Ss; ideally you want lots and lots of Ss. 2.Test under controlled conditions to minimize error. 3.Retest under the same controlled conditions to minimize error. 4.Calculate a reliability coefficient (PPMCC or repeated-measures ANOVA) 5.Determine if the reliability coefficient is "acceptable" or not. Interpret
How do you determine validity? That is, what are the steps you take?
1.Select a defined group of > 50 Ss; ideally you want lots and lots of Ss. 2.Establish reliability (using previously-described methods) 3. For criterion-related validity, test all Ss with the new test and the criterion measure and relate scores (e.g., PPMCC); you want strong correlation. 4.For convergent or discriminant validity, test Ss using new test and determine if there is strong relationship (PPMCC) to other measures of same construct (and no relationship [PPMCC] to other measures that it should not be related to). 5.For known-group differences method, test two or three different groups that theoretically should differ on the test and determine if there are mean differences between/among groups (t-test or ANOVA).
Make sure you understand what is face/logical valididy and how it is established.
= degree to which a measure obviously involves the performance being measured •Sometimes referred to as "face validity" •Examples:-Static balance test where balance on 1 foot-Timed running event for speed •Determined by Ss or experts •Evaluated logically, not statistically
Make sure you understand what is content validity and how it is established.
= degree to which a test adequately samples what was covered in a course/unit •Usually used in educational settings •Also used in attitude assessments; do the items represent categories for which they are written?-done by multiple-expert review and % agreement (with original categorization reported) •Example: a basketball skills test should theoretically include items that constitute the game of basketball (shooting, dribbling, passing). •Evaluated logically, not statistically.
Make sure you understand what is criterion validity.
= degree to which scores on a test are related to some recognized standard or criterion. -Are those who scored well on one test, generally scoring well on the gold standard/criterion? Are those who scored poor on one test, generally scoring poor on the gold standard/criterion? If so, the criterion-related validity of your test is supported.
Be able to differentiate relative from absolute reliability. Give examples of relative vs. absolute reliability. See first two pages of the OPTIONAL Brunton et al. reading on reliability posted on Canvas if you don't understand the concepts from the slides.
Absolute reliability is the degree to which repeated measurements vary for individuals (the less they vary the higher the reliability) Ex) SEM standard error of measurement , coefficient of variation Do you get that relative reliability is the degree to which individuals maintain their position in a sample over repeated measurements? PPMCC or ANOVA
What are two types of criterion-related validity? How are they distinquished? What is a good criterion? Study bottom of page 104 and top of page 105 to see what criterion measures are used to validate other tests.
Concurrent Validity •= measuring instrument correlated with some criterion (e.g., lab technique, judges rating) administered concurrently or at about the same time. •Examples: -validate a step test with HRs taken in recovery... as a way of estimating CRE (VO2 max), by relating to lab-measured VO2max (= criterion)-assess if a skills test is valid for sport performance (see if skills test relates to judged criterion) -assess whether a 10-item test of anxiety can be given in a shorter version vs. lengthier criterion version •Determine concurrent (and predictive) validity via: -Correlate scores on new test vs. the criterion in order to obtain a validity coefficient (PPMCC or calculate the multiple correlation coefficient [in case of multiple predictors])Often used when researcher wants to administer a shorter, more easily administered test for a criterion that is more difficult, risky, expensive to measure.25 Predictive Validity = degree to which scores of predictor variables can accurately predict ('estimate") criterion scores (i.e., criterion is some laterbehavior) •Example: use college GPA and GRE scores to predict grad school success •Try to ascertain known "base rate" of occurrence of some-thing to know if your predictor will add good information in explaining the variance (i.e., if the base rate is very high or low, a predictive measure may have little practical value because the increase in predictability is negligible).
Define reliability.
Consistency/repeatability of a observation; degree of which repeated measurements of same trait are reproducible under same conditions precision
To increase reliability you need to decrease _________.
ERROR
True or False? For a test to be reliable it must be valid. YOU MUST KNOW THIS.
FALSE
Know 4 main types of validity inside out and upside down!!!!!!
Face/Logical - does it make sense that this test measures what it says it measures? (determined by Ss or experts) *Content-Related - truth based on logical decision making; expert-determined; e.g., does test represent knowledge content presented during the unit? Does test cover relevant component abilities in appropriate proportion? *Criterion - do scores on test relate to "gold standard"/criterion measure *Concurrent - the criterion measured at same time as alternative measure (criterion = VO2 max via open-circuit spirometry) *Predictive - criterion assessed in future (criterion = NFL success, college success) *Construct - Does the test measure the underlying theoretical construct? Is construct (concept) related to what it theoretically "should" be related to and NOT related to what it "shouldn't" be related to (i.e., convergent and discriminant validity), and is known-group differences validity supported?
True or False? A single predictor usually explains more of the variance in the criterion than do multiple predictors.
False -Multiple regression (R) is often used to determine criterion validity (concurrent and predictive) -several predictors are likely to have a greater validity coefficient than correlations between any one test and the criterion
Recognize all and be able to list some of the sources of error and factors affecting reliability...and think about how you would control for these. I WILL ask, so please review slides #29-34.
Is it too easy or too difficult? Time of the day Tester experience precision of measurement type of instrument used -fatigue practice subject variability time between testing circumstances surrounding the testing periods appropriate difficulty for testing subjects precision of measurement environment conditions
-List the six purposes of evaluation(this is to remind us all why we woke up for a 7:30 class all semester...to help serve our clients, patients, athletes in these ways).
Placement Diagnosis Prediction Motivation Achievement Program evaluation
The standard error of measurement (SEM) is the ____________ for the distribution of measures for an individual repeatedly measured. It is a measure of _____ reliability.
STANDARD DEVIATION absolute
Define and give an example of systematic error.
Systematic error = predictable errors of measurement ex TImer has a delay
True or False? Error can be positive or negative and the mean error is zero.
TRUE
True or False? For a test to be valid it must be reliable. YOU MUST KNOW THIS.
TRUE
True or False? PPMCC or ICC can be used to assess test-retest reliability. Both are measures of relative reliability.
TRUE
True or False? Reliability is the ratio of the true score variance to the observed score variance.
TRUE
True or False? Reliability is the thought of as the proportion of observed score variance that is true score variance. A test is reliable to the extent that observed score variation is made up of true score variation!
TRUE
True or False? Reliability is the ratio of the true score variance to the true score variance + error variance.
True
What is the validity coefficient? What is the SEE? How to you INTERPRET the SEE?
Validity coefficient represents degree to which a measure correlates with criterion or concepts (range from -1 to +1; closer the absolute value is to 1, greater validity) . •SEE (a validity stat) indicates degree to which person's predicted/estimated score will vary from the criterion.
When do you use "mean" score in a mulitple trial test? When do you use "best" score in a multiple trial test?
When criterion scores is used as an... indicator of MAXIMAL POSSIBLE PERFORMANCE the BEST score should be used (just remember that best may not be typical, it could be inflated by measurement error) indicator of TRUE ABILITY (typical performance) then MEAN score is used
Define and give an example of random error.
errors due to chance than can affect subjects score in an unpredictable way from trial to trial ex Subject doesn't stand straight when getting measured
Distinquish inter- vs. intra-rater reliability. Which is likely to be higher? Aren't we talking about objectivity here?
intrarater reliability = degree of agreement among repeated administrations of a diagnostic test performed by a single rater (anova repeated measures) interreater reliability agreement, or concordance, = degree of agreement among/between raters (PPMCC)
Generally, the longer the time between test administrations the __________ the reliability. (Fill in the blank with "higher" or "lower".)
lower
-What type of reliability is used most often in the health and fitness domain?
test-retest
What type of reliability is used most often in the health fitness domain?
test-retest
Why does the ACSM give so many details on measurement of circumferences?
to lessen error...to increase reliablity and rater reliability [objectivity].) Note: your job as a tester is to minimize error! Maybe you need a defined scoring system, standardized procedures, etc.
True or False? Correlation and multiple regression are the main statistics used to determine criterion-related validity.
true -indicates if two or more measure ares related
What two components make up an observed score?
true score + error score
What other test characteristics other than R&V must you consider when choosing a test? REMEMBER: tests need to be practical!!!
•Discrimination-Ability to differentiate between people •Practicality-Feasibility within constraints of tester and situation (cost, time, etc.) •Mass testability-Ability to test many people at the same time •Documentation-Test manual and other information
What is construct validity?
•Use to validate measures that are unobservable, yet exist in theory. •= degree to which a test measures a hypothetical construct (e.g. anxiety, sportsmanship); usually established by relating the test results to some behavior. •If, in theory, the construct is valid, then such-and-such should occur . . .e.g., self-esteem should be related to self-worth & confidence, and should not be related to mood. •Convergent and discriminant evidence and/or known-group differences are ways of establishing construct validity. •How established:-(usually) relate the test to some behavior using correlational techniques (Pearson r; factor analysis) A test has construct validity if it accurately measures a theoretical, non-observable construct or trait. The construct validity of a test is worked out over a period of time on the basis of an accumulation of evidence. There are a number of ways to establish construct validity. -Known Group Differences Method(also called Divergent Groups Validity)=The test scores of groups that should differ on a trait or ability are compared, = method used for establishing construct validity in which the test scores of groups that should differ on a trait or ability are compared •Test of anaerobic power should show higher scores in sprinters than in distance Convergent Validity= correlate with similar traits. That is the degree to which an operation is similar to (converges on) other operations that it theoretically should also be similar to. For instance, to show the convergent validity of a test of mathematics skills, the scores on the test can be correlated with scores on other tests that are also designed to measure basic mathematics ability. High correlations between the test scores would be evidence of a convergent validity.Convergent validity shows that the assessment is related to what it should theoretically be related to. Discriminant Validity(also called Divergent Validity)-not correlate with different traits.