PSYCH 440- Exam 2
criterion referenced tests
designed to measure mastery of a set of concepts, not individual differences items are organized hierarchically, so they have limited/unequal variances and are not distributed normally
step 3: test tryout
determine which items are the best to use in a test rule of thumb: initial studies should use 5+ participants for each item in the test
standard error of difference (SED)
used to make 3 comparisons: same individual, same test different individuals, same test different individuals, different test
assessing criterion validity
validity coefficient- typically a correlation coefficient between scores on the test and some criterion measure (rxy)
interpreting content validity ratio
values range from -1 to 1 negative value: less than half indicating "essential" 0: half indicating "essential"/there is no agreement positive value: more than half indicating "essential"
true variance
variability between scores due to actual differences in standings on the construct that we are trying to measure
what is the fundamental idea of classical test theory?
we are getting observed scores, not an exact measurement, so we are estimating
reliability
stability or consistency of measurement
Thurston scaling method
process designed for developing a 'true' interval scale for psychological constructs
5 stages of test development
1. conceptualization 2. construction 3. tryout 4. analysis 5. revision (as needed, with tryouts)
steps to content validity
1: content validity requires a precise definition of the construct being measured 2. domain sampling is used to determine behaviors that might represent that construct 3. determine adequacy of domain sampling using Lewshe's Content Validity Ratio
test construction: scoring items
3 scoring options: cumulative, class/categorical, and ipsative (responses are compared to own person's responses)
item analysis: item indices
4 indices: item difficulty, reliability, vailidity, and discrimination
expectancy data
Additional information that can be used to help establish the criterion-related validity of a test Usually displayed using an expectancy table that parses the likelihood that a test-taker will score within some interval of scores on a criterion measure
true score confidence intervals
CI = observed score +/- (z-score)(SD) example: SEM = 15, student earns score of 5, what is CI for 95% level? --> 55 - (1.96)(15) = 25.6 --> 55 + (1.96)(15) = 84.4
interpreting validity coefficients
Can be severely impacted by restriction of range effects: underestimate validity of measure and need whole range of performance because it will not be effective in making predictions without it
Most test-takers score high on a particular test. What effect is most likely occurring?
Ceiling effect
homogeneity evidence
Do subscales correlate with the total score? Do individual items correlate with subscale or total scale scores? Do all of the items load onto a single factor using a factor analysis?
incremental validity
Does this test predict any additional variance than what has been previously predicted by some other measure?
change evidence
If a construct is hypothesized to change over time (e.g., reading rate) or not (e.g. personality) those changes should be reflected by either stability or lack of stability (depending on your theory) Should the construct change after an intervention (e.g., psychotherapy, training)?
types of scale methods
Likert Rating Guttman
examples of concurrent validity
New Depression Inventory correlate with Beck Depression Inventory Scores SAT scores compare with high school GPA
example of incremental validity
New IV does not overlap previous IV
reliability coefficients
Numerical values obtained by statistical methods that describe reliability range form 0 to 1, where negative values are unlikely
reliability coefficient formula
R = o2true / o2total total variance = true variance + error variance
examples of predictive validity
SAT scores predicting college GPA interests and values predicting job satisfaction news reporters catch a car accident live on tv when talking about a dangerous intersection
SEM formula
SEM = o* sqrt (1-rxx) o = standard deviation r = reliability
inter-rater reliability
degree of agreement or consistency between 2+ scorers/judges/raters with regard to a particular measure
evidence for construct validity
The test is homogenous, measuring a single construct Test scores increase or decrease as theoretically predicted Test scores vary by group as predicted by theory Test scores correlate with other measures as predicted by theory
classical test theory
X = T + E Observed score = True score + Error
criterion-related validity
a good criterion is generally: relevant (stuff we care about) uncontaminated (does not directly impact original measure)
power test
a test that varies in item difficulty most test-takers will complete the test, but not get all the answers correct can use all regular reliability coefficients
speed test
a test where the items are equal in difficulty and has a time limit most people will not finish the test, but will get the answers correct reliability must be based on multiple administrations: test-retest, alternate forms, and split-half
what are the ways to adjust group scores based on a biased test?
addition of points differential scoring items elimination of items based on differential item functioning differential cutoffs separate lists within-group norming banding preference policies
what is reliability not concerned with?
all things validity: measuring what we intended to measure appropriateness of how we use info test bias
generalizability theory
alternative view of classical test theory that suggests a test's reliability is a function of circumstances in which the test is developed, administered, and interpreted
administration error
anything that occurs during the administration of the test that could affect performance different types: environmental, test-taker, and examiner factors
if 5% of the population has depression, then 5% would be the _____ rate of depression for that poluation.
base rate
Why might ability test scores among test takers most typically vary?
because of the true ability of the test-taker and irrelevant, unwanted influences
What makes a biased test different than an unfair test?
biased test has something to do with test construction fairness has more to do with personal and cultural values
cronbach's alpha
calculated by using the mean of all possible split-half correlations ranges from 0 to 1, where the values closer to 1 indicate greater reliability
coefficient of equivalence
component of parallel or alternate forms that is calculated by correlating two forms of a test sources of error that impact this: motivation and fatigue, events that happen between the two administrations, item selection, and construct is highly influenced by practice effects
coefficient of stability
component of test-retest reliability that is calculated by correlating the two sets of test results sources of error that impacts this: stability of construct, time, practice effects, fatigue effects
construct validity
process of determining the appropriateness of inferences drawn from test scores measuring a construct connected to umbrella validity/ concept
item difficulty index
develop a test that has "correct" and "incorrect" answers "correct" items should be based upon differential standings on that attribute (those who are high for it) proportion of total number of test-takers who got the item right (p = .80, 80% got item correct) item-total correlation- simple correlation between score on an item and total test scores --advantage: can test statistical significance and interpret % of variability item accounts for
Kuder-Richardson
developed two types: KR-20 and KR-21 statistic choice with dichotomous items KR-20 yields a lower estimate of r when items are heterogeneous
item validity index
does item measure what it is purported to measure? evaluated using latent trait theory and confirmatory factor analysis
convergent validity
does our measure highly correlate with other tests designed for or assumed to measure the same construct? doesn't have to measure the exact construct, similar constructs are okay example: supposed to be correlated, and is correlated
face validity
does the test look like it measures what it's supposed to has to do with judgments of the TEST TAKER, not test user should consider how people will respond if they know the content of the test and high face validity can allow deception biggest issue is with credibility
Messick's validity
evidential: face, content, criterion, construct, relevance and utility consequential: appropriateness of use determined by consequences of use
base rate
extent to which a particular trait, behavior, characteristic, or attribute exists in the population inversely related: low base rate = harder to predict ex: when the next meteor hits earth
test tryout: faking
faking- issue with attitudes faking good= positive self presentation faking bad= trying to create a less favorable impression corrections: lie scales, social desirability, fake good/bad scales, infrequent response items, total score corrections based on scores obtained from measures of faking, and using measures with low ace validity
generalizability vs true score theory
generalizability attempts to describe the facets to generalize scores true score theory doesn't identify effects of different characteristics on the observed score
step 4: item analysis
good test items are reliable and valid help discriminate between test-takers based on an attribute used to differentiate good from bad items
test tryout: guessing
guessing- only an issue for tests where a "correct answer" exists methods: -verbal or written instructions that discourage guessing -penalties for incorrect answers -not counting omitted answers as incorrect -ignoring the issue
SEM and reliability
has an inverse relationship with reliability can assume a normal distribution between SEM and true score distribution Low SEM = more precision (vice-versa) reliable test = low SEM
validity
how well a test measures what it is supposed to measure
homogeneous test construction
if a test measures only one construct
impact of facets on test scores
if all facets are the same across administrations, we would expect the same score each time if facets vary, we would expect scores to vary
adding items to a test will do what to the reliability of the test?
increase the reliability
item reliability index
indication of the internal consistency of the scale and uses factor analysis so items can be discarded or revised
test construction error
item or content sampling: differences in wording or how content is selected can be produced by variation in items within a test or between different tests may have to do with how and what behaviors are sampled
Guttman scale
items range from weaker to stronger expressions of variable being measured/ when one item is endorsed, the less extreme positions are also endorsed produces an ordinal scale
content validity
judgment of how adequately a test samples behavior representative of the behavior that is was designed to sample
examples of environmental factors
lighting, noise, temperature, how comfortable the chair is
For a heterogeneous test, measures of internal-consistency reliability will tend to be ________ compared with other methods of estimating reliability.
lower, because the scores will not correlate as well and have a lower reliability
how can you improve a test's reliability to get the "truest scores"?
maximize true variance and minimize error variance
internal consistency
measure of consistency within a test (how well do the items "hang together" or correlate with each other? 3 ways to measure: split-half, kuder-richardson, and cronbach's alpha
psychological traits
measures of things that are context dependent or have quick changes form one moment to another
discriminant validity
measures shouldn't correlate with measures of dissimilar constructs is a problem when a measure accidentally correlate highly with other measures/variables it shouldn't ex: shouldn't correlate and it doesn't
false positive
miss tool of assessment indicating the test-taker possesses or exhibits a particular trait, ability, behavior, or attribute when in fact the test-taker does not ex: test is positive, shock wears off
step 5: test revision
modifying test stimuli, admin on the basis of either quant/qualitative item analysis goal is to balance strengths and weaknesses for intended use and population of test cross-validation- re-establishing reliability and validity of the test with other samples; conducted by developer with an interest in that area and uses a different sample than one used in development item-fairness- item is unfair as it favors one particular group of examinees in relation to another
examples of test-taker variables
mood, alertness, errors in entering answers
what are the 3 types of selected-response format items?
multiple-choice matching binary choice (T/F)
item discrimination explanations
no discrimination because experts and novices are the same (0 discrimination value) discriminates in expected direction because experts are better than novices (positive discrimination value) discriminates in expected direction if intention is for novices to do better than experts (negative discrimination value)
examples of facets
number of test items training of test scorers purpose of test administration
examples of examiner-related variables
physical appearance, demeanor, non verbal cues
true variance and reliability
positive, linear relationship 0 reliability means you are measuring only error perfect reliability = 1.0 as reliability increases, you measure more of what you want to
true score estimates
observed score is best estimate of the true score observed score and estimate of true score are both 100
SED formulas
odiff = sqrt (o2mean1 + o2mean2) odiff = o * sqrt (2 - r1 - r2) SED > SEM because it accounts for two sources of error
what type of reliability estimate is the most expensive to obtain?
parallel form because you would need to verify both tests
error variance
part of the observed score that is due to irrelevant, random sources
error
part of the score that deviates from the true score
hit rate
proportion of people accurately identified as possessing or exhibiting some characteristic want a high hit rate to show correct outcome
miss rate
proportion of people the test fails to identify as having or not having a particular characteristic what you are concerned about want a low miss rate 2 types: false negative and positive
selection ratio
proportion of population that is selected as validity increases, the selection ratio will be improved over the base rate with a small selection ratio, even a small increase in validity will help
intentional admin error
responding to a question without listening to it (example: obscure answers on family feud)
test construction: writing items
rule of thumb: write 2x as many times for each construct as what will be intended for the final test version item pool is a reservoir of potential items that may or may not be used on the test; good content validity = items that represent that item pool
step 2: test construction
scaling is the process of selecting rules for assigning number to measurement of varying amounts of some trait, attribute, or characteristic includes rating scales
test construction: choosing item type
selected response items- less time to answer and breadth of knowledge constructed response items- time consuming to answer and depth of knowledge selected response item scoring- more objective and reliable than scoring of constructed response items
error and reliability
sources of error help us determine which reliability estimate to choose
step 1: test conceptualization
stimulated by societal trends personal experience gap in literature need for new tool to assess a construction norm-referenced test , a good item is one where people who scored high on the tend to get it right, low score = tended to get it wrong
scoring and interpretation error
subjectivity of scoring is a source of error variance more likely a problem with non-objective tests essay tests, behavioral observations, and computer scoring errors
method of paired comparisons
taker presented with 2 test stimuli and are asked to make some sort of comparison
Likert-type rating scale
taker presented with 5 alternative responses on some continuum extreme signed scores of 1 and 5 that are generally reliable, result in ordinal-level data, and have a summative scale
sorting tasks
takers asked to order stimuli on basis of some rule categorical- placed in categories comparative- placed in an order
construct validation
takes place when an investigator believes that their instrument reflects a particular construct, to which are attached certain meanings proposed interpretation generates specific testable hypotheses, which are means of confirming or disconfirming the claim
sources of errror
test construction test administration test scoring and interpretation **can be all 3
split-half reliability
test items are split in half and then the scores are correlated; uses Spearman-Brown
false negative
test predicted that the test-taker did not possess a particular characteristics or attribute being measured when the test-taker actually did ex: covid/pregnancy test is negative, but symptoms persist
types of reliability
test-retest parallel or alternate forms inter-rater split-half and other internal consistency measures
concurrent validity
the degree to which a test score is related to a criterion measure obtained at the same time; compare your measure to an already established measure
predictive validity
the degree to which a test score predicts scores on some future criterion strongest form of validity
test-retest reliability
the same test is given twice to the same group at different times
Spearman-Brown
used to estimated reliability of a test that is shortened or lengthened reliability increases as the length of a test increases rab = n*rxy / [1+ (n -1) * rxy]
true score
the true standing on some construct
multitrait-multimethod matrix
tool used to evaluate both convergent and discriminant evidence, along with factor analysis Multitrait - must include two or more traits in the analysis Multimethod - must include two or more methods of measuring each construct
criterion vs norm referenced
traditional reliability methods are used with norm-referenced tests criterion tests tend to reflect material that is mastered hierarchically that will reduce variability scores and reliability estimates
dynamic traits
traits that change slowly over time example: developmental, effects of experience
static traits
traits that do not change over time
parallel or alternate forms
two different versions of a test that measure the same construct differs because parallel forms has equal means and variances
item discrimination index
used to compare performance on a particular item with performance in the upper and lower regions of distribution of continuous test scores symbolized by 'd'- compares proportion of high scorers getting item "correct" and proportion of low scorers getting item "correct" method 2: choose items on basis discrimination between groups and different terms (contrasted groups, empirically keyed scales, and actuarial designs) most difficult indexes to least difficult (lowest to highest number)
standard error of measurement (SEM)
used to estimate the extent to which an observed score deviates from the true score
heterogeneous test construction
when a test measures multiple constructs
restriction of range
when there is not a full range of scores, reliability coefficients may not reflect the true population coefficient influenced by sampling procedures and test difficulty