Topic 1: Basic Measurement Issues
convergent validity
Do different measures of same trait correlate with each other even when they use different methods of measurement?
Factor analysis (way to measure Construct V)
provides analytical method for estimating correlation b/w specific variable (test score) and scores on factor
Issues for using it correction of attenuation/ spearman's true score
-Different researchers may have different definitions of error and may correct for different things. - Different types of reliability measures are sometimes mixed making it difficult to evaluate connected correlations.
Cronbach's theory of generalizability
-alternate method of measuring consistency of test score -generalize from one set of measures to another rather than amount of measurement error (which is focus of reliability theory) central question: under what condition can we generalize results? -conceputal not statistical -think of reliability as situational (dependent on use of test scores) rather than test scores themselves -more practical in real life situations
Inter rater reliability
-consistency across scores assigned by different judges or raters -used for clinical interview, job interview, performance rating, thematic apperception test -compute avg correlation b/w scores provided by different judges/raters
What is the definition of reliability?
-consistency/stability of test scores -proportion of variance in test scores that is systematic attributable to construct of interest (if 80% is systematic, 20% is random error) -_______ helps you figure out what proportion is systematic & random error Observed test score = true score + error of measurement
validity of measurement
-content, face, construct validity -extent to which test measures what they seem to measure
validity of decisions
-criterion related validity (predictive and concurrent) -extent to which test scores contribute to making correct predictions or decisions about individuals
What level of reliability is sufficient?
-depends on how important the decisions that will be made on basis of test scores -depends on how many different categories individuals will be sorted into (fewer categories = lower reliability is acceptable)
Face validity What is it? Why do we care about it?
-extent to which test appears to measure what it should be measuring -based from perspective of test taker or non-experts -can influence test takers'/ test user' opinion & reaction to test -may increase test taking motivation should not be used as substitute for other methods of validity
What is the definition of validity?
-extent to which test measures what it is supposed to measure
Content related evidence of validity (content validity) What is it? Methods of assessing?
-extent to which test's items provide representative sample of what test is supposed to measure -should be built into the test by: a. define trait you wish to measure in specific terms b. create test items that are consistent with definition -judgement of panels of experts -usually used for concrete domains like classroom tests
participant attrition
-fall out -result in same sort of ppl dropping out or total randomness -if attrition is non-random, estimate of correlation won't be valid
carry over effects
-make test look more stable • if you remember the questions and you remember what you answered and you just think that you should answer the same way this increases reliability
thematic apperception test
-measure aspects of personality -given pictures, write story about pictures -tells scorer something about your personality
Criterion related validity a. What is it? b. Predictive validity design-description and problems/issues c. Concurrent validity design-description and problems/issues d. Range restriction e. Should you use predictive or concurrent design? f. How strong should criterion related validity correlation be?
-measures relationship b/w test scores and criterion -criterion: measures of outcomes that specific decisions are designed to produce -method of estimating: compute correlation b/w test scores & criterion b. administer test at time A (before hiring), measure criterion at time B (after hiring) -compute correlation b/w test scores problems: -passage of time, increase in P attrition -practical objections (impractical, decisions are made without test scores; however for this to work, applications must either be randomly accepted for job or all must be accpeted) -but if you higher ppl will full range of scores, you lower overall productivity -it's also not ethical to hire someone who scores low just for sake of measuring correlation (ethical objection) c. administer test and measure criterion at roughly same point in time --> size of validity often similar to predictive validity -compute correlation b/w test scores and criterion scores -using incumbents (ppl who already have job) -quicker and easier than predictive design problems: -test taking motivation -job experience -range restriction (reduce correation b/w test scores and criterion measures) -current employees may be systematically different from population (so not useful estimate of validity) e. are differences b/w incumbents and applicants likely to be important f. .6 are good but rare .1 or .2 might be acceptable -this depends on variety of factors: -utility analysis: takes into account other factors ♣ Ex: if you had 1000 ppl apply to job and you only need to hire 50, even test with low validity is useful, you can set cut off score for that test really high ♣ Only hire people in the top 90th percentile
coefficient alpha
-most widely used, most general form of internal consistency -represents mean reliability coefficeint one would obtain from all possible split halves -used to estimate how much scores might fluctuate due to measurement of error, estimates effects of unreliability on correlation b/w tests and criteria of interest
Multitrait multimethod approach (way to measure Construct V)
-number of methods to measure more than one trait or construct -investigate and compare correaltions solid triangle: measures of different traits using same method broken triangle: different traits measured by different methods -best estimate of true relations since no method of bias convergent validity underlined: multiple measures of constructs converge to yield similar results -convergent validity first link to construct validity -same traits, different measures
Construct validity What is it? Methods of assessing (multitrait multimethod matrix and other methods)
-overall judgment as to whether or not test measures what it purports to measure -based on all available evidence: -content -criterion-related -experiments -factor analysis (tells you how many dimensions there are) -multitrait multimethod matrix analysis (tells you if different measure of same trait are relatively highly related to each other) -a test shows high CV if pattern of relationships b/w test scores and behaviour measures similar to relationship expected from perfect measure of construct -stronger match b/w expected and actual correlations of measure & behaviour, stronger evidence for CV
Which estimate(s) of reliability should I focus on?
-think about intended application(s) of test -test has many items and they are all measuring same thing--> high internal consistency -clinical interview asked to respond to clinical interview test 1 at 9 am Monday and test 2 11pm Tuesday (inter rater reliability)
Alternate forms reliability = coefficient of equivalence
-time frame doesn't matter -common method of estimating: a. develop/acquire alternate (equivalent or parallel) versions of same test b. administer both forms to same ppl c. compute correlation b/w ppl's scores on 2 forms of test PROS vs CONS a. practical, use same ppl more than once b. backup copy of test is available if original is compromised c. no carry over effects since two tests are different d. reduction of reactivity e. difficult, expensive to develop alternate/parallel form f. not guaranteed that 2 tests will actually be identical g. lack of equivalence = low reliability
Internal consistency-most common
-to asses homogeneity/consistency of test items -affected by: a. correlation b/w items -if all items correlate you'll get higher internal consistency b. number of items -with more items, you'll have higher number of internal consistency -if you only have 2 items you might get high score on both but if you have 20 items, less chance of getting lucky (so high score more indicative of actual true score) -methods of estimating: a. split half (not good) -apply spearman brown statistical correction to correlation -^tells you how strongly test would correlate with itself if it was full length instead of half -easiest way to create 2 alternate forms is to split half carry over effects, reactiivty effects are minimized -but many unequal ways to split in half (best should be to splitl up odd and even) b. cronbach's alpha (more common) -higher alpha score if you have stronger correlation b/w scores on items and with items c. kuder-richardson 20 (KR20) -same as alpha but a computational short cut if items on test are dichotomous
What are the different ways of assessing validity?
1. Content related evidence of validity (content validity) 2. Face validity 3. Criterion-related 4. Construct
What are the different ways of assessing reliability?
1. Test-retest= coefficient of stability 2. Alternate forms reliability= coefficient of equivalence 3. Internal Consistency 4. Inter-rater reliability
factors that affect reliability of test
1. mean inter item correlation (r) 2. length of test (K) 3. variance of true scores in sample
Pearson Product-Moment Correlation Coefficients aka correlation coefficient -5 points
1. r (sample) or p (population) 2. -1 to 1 3. - = negative relationship; + = positive relationship 4. absolute value = strength of relationship, closer to 0 is weak and closer to 1 is strong 5. correlation doesn't imply causality
Problems with test retest aka coefficient of stability
1. reactivity 2. impractical bc expensive & time consuming 3. carry over effects 4. test taking motivation 5. participant attrition
Multitrait multimethod suggests good test of construct if:
1. scores on test consistent with scores obtained from other measures of same construct 2. test yield scores not correlated w/ measures that are theoretically unrelated to construct 3. method of measurement employed by test shows little evidence of bias
example of concurrent vs predictive criterion validity strategy
Concurrent: uses scores of screened sample to evaluate relationship b/w test and criterion Predictive: uses score of representative or unscreened sample to evaluate relationship b/w test and criterion
What's the goal of reliability theory?
Determine how much of variability in test scores is due to E and T -suggest ways to improve test so that E is minimized
discriminant validity
Is your test relatively uncorrelated with tests that it should be relatively uncorrelated with?
Problems for each test test retest alternate forms split half internal consistency
Reactivity, carryover, true changes over time. Non-parallel forms, inconsistencies of test content, reactivity, carryover, true changes over time. Non-parallel halves, inconsistencies of test content. Inconsistencies of test content.
method of bias
When two different constructs are measured using the same method, expect some correlation solely as result of the common method of measurement. The multi trait - multi method study allows us to estimate the effects of bias.
Equation for classical test theory *from here down are from reading only
X = T + E X= score obtained by individual on test (observed measurement) T= individual's true score on attribute-true amount of attribute in individual E = an error score associated w/ measurement (fts of individual or situation that have nothing to do with attribute being measured) -across large number of individuals, this is assumed to be random -no correlation b/w T & E
What is the difference between reliability and validity? Is it possible for test to be reliable but not valid Is it possible for test to be valid but not reliable
Yes--> Measure of matching colours would produce consistent scores but aren't valid measurement of intelligence No --> must be level of consistency
construct
abstract attribute that scientsts construct to describe concrete, observable entities of events -although they are hypothetical abstractions, all constructs are related to real/observable things ex: happiness, honesty -you cannot hold it in your hand but can study link of construct to something observable how do you do this? 1. identify behaviours related to construct 2. identify other constructs and decide if the two are related 3. identify behaviours related to additional constructs
The bandwidth-fidelity dilemma
bandwidth= amount of info -similar to concept of content coverage of test fidelity= accuracy or precision of info -similar to concept of reliability (lower reliability means that score on test might be quite different than trait) there is a trade off b/w bandwidth and fidelity also tradeoff b/w content coverage and reliability *you can make a test look highly reliable just by narrowing content so it doesn't mean that it is an excellent test (just very narrow) think of example about 4 facets of risk taking
What levels of reliability are typical?
ex in article table 7-6 "" levels of reliability typically reported for different types of tests and measurement devices" -in reading for oct 4
Test retest= coefficient of stability
i.Administer test to same group of ppl at two separate points in time ii.Compute the correlation b/w scores from two administrations of the test How long should time lag be? -depends on if you expect scores on construct to be stable ex: GRE (fairly long time lag) vs. anxiety test (may change quickly)--> pick a time lag short enough that true level of item hasn't changed, if it does change you'll get a low correlation
range restriction
if you restrict range of scores by only have ex: ppl who scores 60 or above, then you lower correlation (weak) -can correct for this (mentioned in tett article)
standardized alpha
measure of internal consistency -accounts for number of test items and mean inter-correlation among test items
length of test influence on reliability
more items you have, more random error cancels each other out (this is because errors are sometimes + and sometimes -) true score variance doesn't cancel out as number of items increase typical strategy to increasing reliability is to increase test length
What is correction for attenuation? What's the formula, how do we use it?
r'xy= rxy/(square root of rxx x ryy) = .447 estimate of the correlation you would expect between two imperfect measures if the effects of measurement error could be removed -measurement errors would decrease correlation b/w 2 tests thus validity would also be decreased -provides estimate of correlation b/w perfectly reliable measure of X and Y -so this formula can be used to estimate correlation b/w two tests if reliability of either or both tests could be improved by a scientific amount -allows us to estimate effects of raising and lowering reliability of X and Y on correlation b/w X and Y Corrected r > 1.0? -reliability plugged into formula not accurate -lower reliability means you're correcting formula more strongly (isn't good estimate if it is too low) Used in meta analyses
mean inter item correlation influence on reliability
reliability of single items on test; each item can be considered a parallel mini test utilized in calculations of standardized alpha-internal consistency
criterion
something that is relevant to purpose of test, e.g. a measure of psychological adjustment; a measure of job performance, etc.
SEM
standard error of measurement -measure of precision of observed test score -used to estimate extent to which observed score deviates from true score -standard deviation of theoretically normal distribution of test scores -higher reliability of test = lower SEM score -used to form confidence intervals to tell you how accurate test scores are function of 2 factors: -reliability of test and variability of test scores
reactivity--> makes test scores look unstable this is experience
the first time you take a test, something about it changes you so you react different during the second test now it looks like the test isn't stable by thinking about the question (if the question is about anxiety) you think you're more anxious than you actual are
utility theory
theory for combining information about criterion - related validity and the decision context (eg. base rate, selection ratio) to estimate the gains or losses associated with using a test
Spearman brown prophecy formula
used to predict effect of lengthening test (what will new reliability estimate be once we make test longer) -can also use this to determine how much longer test needs to be to obtain desired reliability
Two aspects of Validity (validity is a fx of what scores mean)
validity of measurement validity of decisions