Topic 1: Basic Measurement Issues

Ace your homework & exams now with Quizwiz!

convergent validity

Do different measures of same trait correlate with each other even when they use different methods of measurement?

Factor analysis (way to measure Construct V)

provides analytical method for estimating correlation b/w specific variable (test score) and scores on factor

Issues for using it correction of attenuation/ spearman's true score

-Different researchers may have different definitions of error and may correct for different things. - Different types of reliability measures are sometimes mixed making it difficult to evaluate connected correlations.

Cronbach's theory of generalizability

-alternate method of measuring consistency of test score -generalize from one set of measures to another rather than amount of measurement error (which is focus of reliability theory) central question: under what condition can we generalize results? -conceputal not statistical -think of reliability as situational (dependent on use of test scores) rather than test scores themselves -more practical in real life situations

Inter rater reliability

-consistency across scores assigned by different judges or raters -used for clinical interview, job interview, performance rating, thematic apperception test -compute avg correlation b/w scores provided by different judges/raters

What is the definition of reliability?

-consistency/stability of test scores -proportion of variance in test scores that is systematic attributable to construct of interest (if 80% is systematic, 20% is random error) -_______ helps you figure out what proportion is systematic & random error Observed test score = true score + error of measurement

validity of measurement

-content, face, construct validity -extent to which test measures what they seem to measure

validity of decisions

-criterion related validity (predictive and concurrent) -extent to which test scores contribute to making correct predictions or decisions about individuals

What level of reliability is sufficient?

-depends on how important the decisions that will be made on basis of test scores -depends on how many different categories individuals will be sorted into (fewer categories = lower reliability is acceptable)

Face validity What is it? Why do we care about it?

-extent to which test appears to measure what it should be measuring -based from perspective of test taker or non-experts -can influence test takers'/ test user' opinion & reaction to test -may increase test taking motivation should not be used as substitute for other methods of validity

What is the definition of validity?

-extent to which test measures what it is supposed to measure

Content related evidence of validity (content validity) What is it? Methods of assessing?

-extent to which test's items provide representative sample of what test is supposed to measure -should be built into the test by: a. define trait you wish to measure in specific terms b. create test items that are consistent with definition -judgement of panels of experts -usually used for concrete domains like classroom tests

participant attrition

-fall out -result in same sort of ppl dropping out or total randomness -if attrition is non-random, estimate of correlation won't be valid

carry over effects

-make test look more stable • if you remember the questions and you remember what you answered and you just think that you should answer the same way this increases reliability

thematic apperception test

-measure aspects of personality -given pictures, write story about pictures -tells scorer something about your personality

Criterion related validity a. What is it? b. Predictive validity design-description and problems/issues c. Concurrent validity design-description and problems/issues d. Range restriction e. Should you use predictive or concurrent design? f. How strong should criterion related validity correlation be?

-measures relationship b/w test scores and criterion -criterion: measures of outcomes that specific decisions are designed to produce -method of estimating: compute correlation b/w test scores & criterion b. administer test at time A (before hiring), measure criterion at time B (after hiring) -compute correlation b/w test scores problems: -passage of time, increase in P attrition -practical objections (impractical, decisions are made without test scores; however for this to work, applications must either be randomly accepted for job or all must be accpeted) -but if you higher ppl will full range of scores, you lower overall productivity -it's also not ethical to hire someone who scores low just for sake of measuring correlation (ethical objection) c. administer test and measure criterion at roughly same point in time --> size of validity often similar to predictive validity -compute correlation b/w test scores and criterion scores -using incumbents (ppl who already have job) -quicker and easier than predictive design problems: -test taking motivation -job experience -range restriction (reduce correation b/w test scores and criterion measures) -current employees may be systematically different from population (so not useful estimate of validity) e. are differences b/w incumbents and applicants likely to be important f. .6 are good but rare .1 or .2 might be acceptable -this depends on variety of factors: -utility analysis: takes into account other factors ♣ Ex: if you had 1000 ppl apply to job and you only need to hire 50, even test with low validity is useful, you can set cut off score for that test really high ♣ Only hire people in the top 90th percentile

coefficient alpha

-most widely used, most general form of internal consistency -represents mean reliability coefficeint one would obtain from all possible split halves -used to estimate how much scores might fluctuate due to measurement of error, estimates effects of unreliability on correlation b/w tests and criteria of interest

Multitrait multimethod approach (way to measure Construct V)

-number of methods to measure more than one trait or construct -investigate and compare correaltions solid triangle: measures of different traits using same method broken triangle: different traits measured by different methods -best estimate of true relations since no method of bias convergent validity underlined: multiple measures of constructs converge to yield similar results -convergent validity first link to construct validity -same traits, different measures

Construct validity What is it? Methods of assessing (multitrait multimethod matrix and other methods)

-overall judgment as to whether or not test measures what it purports to measure -based on all available evidence: -content -criterion-related -experiments -factor analysis (tells you how many dimensions there are) -multitrait multimethod matrix analysis (tells you if different measure of same trait are relatively highly related to each other) -a test shows high CV if pattern of relationships b/w test scores and behaviour measures similar to relationship expected from perfect measure of construct -stronger match b/w expected and actual correlations of measure & behaviour, stronger evidence for CV

Which estimate(s) of reliability should I focus on?

-think about intended application(s) of test -test has many items and they are all measuring same thing--> high internal consistency -clinical interview asked to respond to clinical interview test 1 at 9 am Monday and test 2 11pm Tuesday (inter rater reliability)

Alternate forms reliability = coefficient of equivalence

-time frame doesn't matter -common method of estimating: a. develop/acquire alternate (equivalent or parallel) versions of same test b. administer both forms to same ppl c. compute correlation b/w ppl's scores on 2 forms of test PROS vs CONS a. practical, use same ppl more than once b. backup copy of test is available if original is compromised c. no carry over effects since two tests are different d. reduction of reactivity e. difficult, expensive to develop alternate/parallel form f. not guaranteed that 2 tests will actually be identical g. lack of equivalence = low reliability

Internal consistency-most common

-to asses homogeneity/consistency of test items -affected by: a. correlation b/w items -if all items correlate you'll get higher internal consistency b. number of items -with more items, you'll have higher number of internal consistency -if you only have 2 items you might get high score on both but if you have 20 items, less chance of getting lucky (so high score more indicative of actual true score) -methods of estimating: a. split half (not good) -apply spearman brown statistical correction to correlation -^tells you how strongly test would correlate with itself if it was full length instead of half -easiest way to create 2 alternate forms is to split half carry over effects, reactiivty effects are minimized -but many unequal ways to split in half (best should be to splitl up odd and even) b. cronbach's alpha (more common) -higher alpha score if you have stronger correlation b/w scores on items and with items c. kuder-richardson 20 (KR20) -same as alpha but a computational short cut if items on test are dichotomous

What are the different ways of assessing validity?

1. Content related evidence of validity (content validity) 2. Face validity 3. Criterion-related 4. Construct

What are the different ways of assessing reliability?

1. Test-retest= coefficient of stability 2. Alternate forms reliability= coefficient of equivalence 3. Internal Consistency 4. Inter-rater reliability

factors that affect reliability of test

1. mean inter item correlation (r) 2. length of test (K) 3. variance of true scores in sample

Pearson Product-Moment Correlation Coefficients aka correlation coefficient -5 points

1. r (sample) or p (population) 2. -1 to 1 3. - = negative relationship; + = positive relationship 4. absolute value = strength of relationship, closer to 0 is weak and closer to 1 is strong 5. correlation doesn't imply causality

Problems with test retest aka coefficient of stability

1. reactivity 2. impractical bc expensive & time consuming 3. carry over effects 4. test taking motivation 5. participant attrition

Multitrait multimethod suggests good test of construct if:

1. scores on test consistent with scores obtained from other measures of same construct 2. test yield scores not correlated w/ measures that are theoretically unrelated to construct 3. method of measurement employed by test shows little evidence of bias

example of concurrent vs predictive criterion validity strategy

Concurrent: uses scores of screened sample to evaluate relationship b/w test and criterion Predictive: uses score of representative or unscreened sample to evaluate relationship b/w test and criterion

What's the goal of reliability theory?

Determine how much of variability in test scores is due to E and T -suggest ways to improve test so that E is minimized

discriminant validity

Is your test relatively uncorrelated with tests that it should be relatively uncorrelated with?

Problems for each test test retest alternate forms split half internal consistency

Reactivity, carryover, true changes over time. Non-parallel forms, inconsistencies of test content, reactivity, carryover, true changes over time. Non-parallel halves, inconsistencies of test content. Inconsistencies of test content.

method of bias

When two different constructs are measured using the same method, expect some correlation solely as result of the common method of measurement. The multi trait - multi method study allows us to estimate the effects of bias.

Equation for classical test theory *from here down are from reading only

X = T + E X= score obtained by individual on test (observed measurement) T= individual's true score on attribute-true amount of attribute in individual E = an error score associated w/ measurement (fts of individual or situation that have nothing to do with attribute being measured) -across large number of individuals, this is assumed to be random -no correlation b/w T & E

What is the difference between reliability and validity? Is it possible for test to be reliable but not valid Is it possible for test to be valid but not reliable

Yes--> Measure of matching colours would produce consistent scores but aren't valid measurement of intelligence No --> must be level of consistency

construct

abstract attribute that scientsts construct to describe concrete, observable entities of events -although they are hypothetical abstractions, all constructs are related to real/observable things ex: happiness, honesty -you cannot hold it in your hand but can study link of construct to something observable how do you do this? 1. identify behaviours related to construct 2. identify other constructs and decide if the two are related 3. identify behaviours related to additional constructs

The bandwidth-fidelity dilemma

bandwidth= amount of info -similar to concept of content coverage of test fidelity= accuracy or precision of info -similar to concept of reliability (lower reliability means that score on test might be quite different than trait) there is a trade off b/w bandwidth and fidelity also tradeoff b/w content coverage and reliability *you can make a test look highly reliable just by narrowing content so it doesn't mean that it is an excellent test (just very narrow) think of example about 4 facets of risk taking

What levels of reliability are typical?

ex in article table 7-6 "" levels of reliability typically reported for different types of tests and measurement devices" -in reading for oct 4

Test retest= coefficient of stability

i.Administer test to same group of ppl at two separate points in time ii.Compute the correlation b/w scores from two administrations of the test How long should time lag be? -depends on if you expect scores on construct to be stable ex: GRE (fairly long time lag) vs. anxiety test (may change quickly)--> pick a time lag short enough that true level of item hasn't changed, if it does change you'll get a low correlation

range restriction

if you restrict range of scores by only have ex: ppl who scores 60 or above, then you lower correlation (weak) -can correct for this (mentioned in tett article)

standardized alpha

measure of internal consistency -accounts for number of test items and mean inter-correlation among test items

length of test influence on reliability

more items you have, more random error cancels each other out (this is because errors are sometimes + and sometimes -) true score variance doesn't cancel out as number of items increase typical strategy to increasing reliability is to increase test length

What is correction for attenuation? What's the formula, how do we use it?

r'xy= rxy/(square root of rxx x ryy) = .447 estimate of the correlation you would expect between two imperfect measures if the effects of measurement error could be removed -measurement errors would decrease correlation b/w 2 tests thus validity would also be decreased -provides estimate of correlation b/w perfectly reliable measure of X and Y -so this formula can be used to estimate correlation b/w two tests if reliability of either or both tests could be improved by a scientific amount -allows us to estimate effects of raising and lowering reliability of X and Y on correlation b/w X and Y Corrected r > 1.0? -reliability plugged into formula not accurate -lower reliability means you're correcting formula more strongly (isn't good estimate if it is too low) Used in meta analyses

mean inter item correlation influence on reliability

reliability of single items on test; each item can be considered a parallel mini test utilized in calculations of standardized alpha-internal consistency

criterion

something that is relevant to purpose of test, e.g. a measure of psychological adjustment; a measure of job performance, etc.

SEM

standard error of measurement -measure of precision of observed test score -used to estimate extent to which observed score deviates from true score -standard deviation of theoretically normal distribution of test scores -higher reliability of test = lower SEM score -used to form confidence intervals to tell you how accurate test scores are function of 2 factors: -reliability of test and variability of test scores

reactivity--> makes test scores look unstable this is experience

the first time you take a test, something about it changes you so you react different during the second test now it looks like the test isn't stable by thinking about the question (if the question is about anxiety) you think you're more anxious than you actual are

utility theory

theory for combining information about criterion - related validity and the decision context (eg. base rate, selection ratio) to estimate the gains or losses associated with using a test

Spearman brown prophecy formula

used to predict effect of lengthening test (what will new reliability estimate be once we make test longer) -can also use this to determine how much longer test needs to be to obtain desired reliability

Two aspects of Validity (validity is a fx of what scores mean)

validity of measurement validity of decisions


Related study sets

Chapter 17: Traction, cast care, and immobilization

View Set

Chapter 2: Thinking Like an Economists

View Set

Epithelial Tissue Identification

View Set

AP Bio Final Chapter 7 Practice Test (1 of these is wrong)

View Set

Management 3120 Baruch Connect Quiz

View Set