Chapter 10 - Validity, Ch. 4 Reliability
Internal Consistency
1 test, multiple items Split Half or Internal reliability Cronbach's Alpha
Standard Error of Measurement answers
"How close is this test result to the true result"? (How accurate is it?)
Observer Differences
1 test, w/ 2+ observers Inter-observer reliability Kappa
Confidence Interval answers
"how likely is a true score to fall within a range"? (e.g., SAT test scores)
Cohen's Kappa ranges:
-1 to 1 "poor" < .40 "good" .40 to .75 "excellent" > .75
Split-half method
The results of 1/2 of the test are compared w/ the results of the other 1/2 (using subscore for odd-numbered items and even-numbered items)
Time Sampling
1 test given twice Test-retest reliability Correlation between scores
4 approaches to Construct Validation Studies
1) Correlation between a measure of the construct and designated (correlation between aptitude test and measure of job performance (construct)) 2) Differentiation between groups 3) Factor Analysis - n measurements on the same examinee and running an nxn F.A. 4) Multi-trait - Multimethod Matrix
Three main types of validation studies (Validity Taxonomy)
1. Content Validation 2. Criterion-Related Validation 3. Construct Validation
Steps to Content Validation
1. Define the performance domain of interest 2. Select a panel of qualified experts in the content domain 3. provide structured framework for the process of matching items to the performance domain 4. Collect and summarize the data from the matching process
Item Sampling
2 different tests given once Parallel (Alternative) forms Correlation between scores
...20% of participants selected and 90% success rate from those selected
20% of participants selected and 90% success rate from those selected
Ratio of reliability - 1.0
= the % of random error (e.g., r=.40, so 40% of variation in the score is due to variation in the "true" score, subtract this from 1.0, 60% of variation is due to random error)
Focus test Methods:
Ad-hoc (informal) - face validity of items & remove items that don't fit Statistical: Factor Analysis & Discriminability Analysis
Reliability can be improved through ________, _______ and _______, and _______.
Ad-hoc methods factor analysis discriminant analysis statistically
Test-retest reliability
Administer same test across some time period, computer correlation between 2 administrations (only used to measure stable traits - ex. NOT the Rorschach inkblot test)
Estimates of Alpha:
Alpha of .70 - .80 = borderline Alpha of .80 = ok Alpha of .90 & higher = good
Charles Spearman (1904)
Article - "The proof of measurement of association between two things" & invented Rho (correlation for ordinal variables)
Pearson, Spearman, Thorndike (1900-1907)
Basic reliability theory
Evidence based on relations to other variables
Can the test usefully predict a certain type of outcome - the Criterion (can a GRE score predict GPA as a grad student)
Test-retest problem:
Carryover effect - when 1st testing session influences scores from 2nd session and changes the True score (e.g., participant remembers their answers from the 1st test)
Multi-trait, multi-method evidence
Combination of convergent and discriminant evidence
Reliability of difference of scores
Common to take difference between 2 scores Problem: taking the difference dramatically reduces the reliability (e.g., 2 tests w/ reliability .90 and .70, correlated to each other by .70, the difference score has a reliability of .33)
Confidence interval formula
Confidence Interval = Z * SEM
Inter-rater reliability
Consistency among different judges who are evaluating the same behavior Observer (rater) errors occur (e.g., was that a "hit" or a "punch")
Reliability
Consistency of measurement
Reporting Criterion-related studies
Continuous Variables - PPMC coefficient - called a validity coefficient Categorical Variables - Phi
Most reliability measures are...
Correlation Coefficients
Split-halves are supplanted by...
Cronbach's Alpha
Test Content Evidence
Define the performance domain item review by experts panel reliability/validity study (absolute vs. relative comparison)
Measurement Error (E)
Difference between the observed score and the true score E = X - T Observed score = True score + Error X = T + E
Evidence based on internal structure
Discovery, definition, and discrimination of the psychological construct theorized to underlie the observed variation among a sample of observed test scores. (linear and non-linear factor analysis)
Errors of measurement
Discrepancy between true ability and measurement of ability
Classical test score theory
Each person has a True score that would be obtained if there were no errors in measurement
Assumption in classical test theory
Errors are random True scores have no variability and that distribution is normal with a Mean of zero
Cronbach's Alpha
Estimates a lower bound for reliability The actual reliability may still be high (alpha can confirm a test has reliability but can't confirm if a test is unreliable)
Specific forms of validity evidence
Evidence based on test content or consequences of test Evidence regarding response processes, internal structure, or relationships with other variables
Construct
Exist but can't be directly measured (e.g., IQ)
How much to increase N?
Formulas used to determine how much to increase N by to reach a certain level of reliability Nd = rd(1-ro) / ro(1-rd) Nd = new N rd = desired level of reliability ro = observed level of reliability
Internal Consistency Reliability
Give single test, calculate internal consistency of various subsets of items
Steps to designing a criterion-related validation study
Identify a suitable criterion behavior and method for measuring it Identify an appropriate sample Administer the test and record scores When criterion data are available, obtain a measure of performance for each examinee Determine the strength of the relationship between test scores and criterion performance
Ways to increase reliability
Increase N (# of items, tests) Focus on single characteristic Use Factor Analysis to determine sub characteristics of a single test Use Item Analysis to find items that measure a single characteristic Statistically correct for attenuation
Drasgow et al (late 1990s)
Item Response Theory (IRT)
Bartholomew & Knott (1990s)
Latent variable theory
Sources of Error (True score - Observed score):
Measurement errors Change in True score (e.g., situational factors: room is too hot, test taker was depressed, etc.)
Kappa definition
Method for assessing the level of agreement among several observers Values: -1 (less agreement than based on chance alone) to 1 (perfect agreement)
Evidence based on response processes
More than one type of response (Art, showing work in math) or developmental/theoretical change (age differentiation, intervention based)
Problem of Domain Sampling:
No way to measure True score because No way to measure every possible item
Result of carryover effect
Overestimates the reliability; only a concern when changes are random, not systematic (e.g., if everyone's score improved by 5 points)
Samuel George Morton
Polygenism (Humans are composed of different species) Craniometry Biological Determinism "Scientific Racism" 50 yrs. before Spearman's work
Convergent Evidence
Positive correlations with similar constructs (variables)
Karl Pearson (1896)
Product-moment correlation (for continuous variables)
r =
Reliability Coefficient of test
Kuder & Richardson (1937), Cronback (1989)
Reliability coefficients
Focus Test
Reliability increases the more the test focuses on a single concept or characteristic vs. capturing multiple concepts in a single test reduces reliability
When reliability is known, we get _____, and from _____ we get ________.
SEM SEM Confidence Intervals
Standard Error of Measurement formula:
SEM = S(sq. root 1-r)
Solution of Domain Sampling:
Sample a limited subset of items by creating 1 or > tests, which will give a normal distribution of the true score Correlation between the 2 tests will be lower than the correlation between 1 test & the True score
Concurrent Validity
The relationship between test scores and criterion measurements made at the time the test was given (pilot written test and performance test)
Observed score (X)
Score as measured by a test (different from a person's true ability)
Two pre-requisites for ethical test use
Scores need to be reliable and inferences valid
Observational data differs from...
Self-report data
Standard error of measurement for Classical test theory
Standard Deviation
3 ways to estimate test reliability
Test-retest method, Parallel Forms, & Internal Consistency
True score (T)
The actual score that exists
Predictive Validity
The degree to which test scores predict criterion measurements that will be made at some point in the future (GRE test and Grad School GPA)
Reliability (Correlation) Coefficient definition
The ratio of the variance of the True scores (T) to the variance of the Observed scores (X) (given as a percentage)
Interpreting validity coefficients
The square of the CC is called the coefficient of determination p=.6 CoD=.36 => 36% of the variance in job performance is related to variance in performance on the predictor test
How to conceptualize error?
To understand complex traits, psychologists use reliable tests.
Practice effect
Type of carryover effect; some skills improve w/ practice
Domain Sampling: Correlation of any 2 random sample tests
Unbiased estimate of "true" reliability
Problem with Split-half method
Underestimates reliability b/c each subtest is only 1/2 as long as the full test, fewer items tested are less reliable than more items tested.
discriminant evidence
Zero (or near zero) correlations with dissimilar constructs (variables)
SEM
a confidence interval around our observed score that estimates a true score
Standard deviation of the distribution of error tells us...
about the magnitude of measurement error
Parallel forms reliability
administer 2 versions of the test to same subjects (often on the same day), compute correlation between 2 administrations
As the sample gets larger...
becomes a more accurate and reliable representation of the domain
SEE
given the reliability of our test, how well does a score predict performance on another test?
The Standard Error of Measurement tells us (on average)...
how much a score varies from the True score
Dispersions around the true score tells us...
how much error there is in the measure
Test validation is a process that includes
integration in assessment design and gathering other evidences for convergent and discriminant evidences
Z score associated with...
percentage of range
Pros & Cons of Parallel Forms
pro - most rigorous method of determining reliability, con - difficult to do, is not often done
Reliability r =
r = .70 or .80 "good enough" for most research r > .90 high reliability
There is more than one type of _________, to measure it we need to specify _____________ comes from
reliability where the measurement error
Domain Sampling Model
taking a sample from a domain to determine your "true" score
Use of test scores
test user be able to justify the inferences drawn by having a cogent rationale for using the test score for that purpose opposed to other available testing procedures
Criterion-related validate
test user desires to draw an inference from examinee's test score to performance on some real behavioral variable of practical importance
Content Validation
test user desires to draw an inference from the examinee's test score to a larger domain of items similar to those on the test
Construct validation
test user desires to draw an inference from the test score to performance that can be grouped under the label of a particular psychological construct "No criterion or universe of content is accepted as truly adequate to define the quality to be measured" (Cronbach and Meehl, 1955)
Reliability is measured by correlating...
the Observed score with the True score
Error impli
the component of the observed test score that does not have to do with the testtaker's ability
Brook's definition of Validity
the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests
Cronbach's definition of validation
the process by which a test developer or user collects evidence to support the types of inferences that are drawn from the test scores
If a test is ______, it is irrelevant whether or not it is ______. _______ is a foundation
unreliable valid reliability
If we know the Reliability (r) of the test...
we can estimate the likely range of true values