Tests and Measures - Test1
Galton
"Applied Darwinist" -"Founder of mental tests and measurements" responsible for launching the testing movement -He believed that tests of sensory discrimination could serve as a means of gauging a person's intellect -Anthropometric laboratory - 1st large scale systematic collection of data on individual differences
Reliability coefficient
(Rxx) Ratio of true score variance to the total variance of test scores σT2 / (σT2 + σе2) σT2 = variance of the true score, σе2 = variance of error (variance of true score) / (variance of true score + variance of error) -Minimal amounts of measurement error produce more consistent and reliable scores (reliability coeff closer to 1) -More error = less reliability, closer to zero
Multitrait-multimethod matrix
(see pg. 171 in book) we look at three different methods and three different traits ex methods - self-report, parent-report, teacher-report traits - depression, anxiety, adhd. The traits should correlate with themselves across measures but not correlate with other traits.
How is construct-related evidence evaluated?
*Convergent validity - correlate the test with existing tests measuring the same or similar constructs *Discriminant validity - correlate the test with existing tests measuring dissimilar constructs *Multitrait-multimethod matrix *Evaluation of internal structure such as factor analysis *Evaluation of response processes engaged in by examinees
Construct-related evidence
- Evidence based on response processes - Evidence based on internal structure (factor analysis) - Evidence based on consequences of testing - intended and unintended - Convergent & discriminant validity - Multitrait-multimethod matrix
Ordinal scale
- rank individuals or objects -but can't say anything about the difference between those ranks -ex. likart scale *Yes magnitude, no equal intervals, no abs zero
Normal curve
-68% of scores fall within 1 standard deviation of the mean (34% on each side) -96% within 2 sd of the mean (add 14% to both sides
Contributions to measurement error
-Time sampling error - error that comes from giving the test at different times -Item (or content) sampling error - comes from including items that dont correctly measure the construct -Test administration - not following guidelines perfectly -Test scoring -Systematic errors of measurement- A test consistently measures something other than what it is supposed to measure (not random).
Ratio scale
-absolute zero -ex. speedometer, height, weight *Yes magnitude, yes equal intervals, yes abs zero
Interval scale
-differences make sense, but ratio's do not (there is no ratio without an absolute zero) -temps, dates, ect *Yes magnitude, yes equal intervals, no abs zero
Evidence based on response processes
-do respondents view and understand items, instructions, and instrument in the same way and respond using anticipated methods -analysis of the fit between performance and actions the examinees actually engage in and the construct being assessed
Nominal scale
-labeling -naming objects -often arbitrary - no meaning *No magnitude, no equal intervals, and no abs zero
How is criterion-related evidence evaluated?
-review subject population in validity study -examine what the criterion really means - (is it valid and reliable?) -never confuse the criterion with the predictor (criterion contamination) - ex. gre's are used to predict success in grad school, some schools will admit low scores and then require a higher gre before graduation. Problem here: low gre scores succeed -third-variable effect -review evidence for validity generalization -consider differential prediction - it's a good predictor of phd's but not masters, ext.
Factors that can affect reliability
-test characteristics (ex. true or false more reliable than likart because there is less variability) -sample characteristics (ex. sample size) -extent of test's clarity
Evidence based on consequences of testing
-what are the expected and unexpected results, or consequences, or measurement?
Assumptions of classical test theory
1. Each person has a true score that would be obtained if there were no errors in measurement 2. Measurement errors are random (some will have positive effects and some will have negative, so over time the mean will be zero) 3. The mean error of measurement equals zero 4. True scores and error scores are uncorrelated 5. Errors on different tests are uncorrelated -Main point: if errors are unsystematic they are random. If they are systematic, we have a problem.
Psychological test
An objective and standardized measure of a sample behavior
Validity
Appropriateness or accuracy of the interpretation of test scores - Validity is a characteristic of test scores, it is not a characteristic of a test. Does the test measure what it was designed to measure?
Difference between test and assessment
Assessment - judgmental process that includes integration of multiple sources of information, including tests Test - single device or procedure
Interscorer reliability and major source of error
Consistency of rater judgement -Problems: subjectivity Error: differences due to raters/scorers
Spearman-brown
Estimates the reliability of the whole test when you only tested for half. You can also calculate the effect of adding/subtracting additional items on reliability (estimate the effect that changing the number of test items will have on reliability)
Test-retest reliability and major sources of error
Evaluates error associated with administering a test at two different times -generally used to evaluate constant traits Test-restest error: time sampling
Alternate-forms reliability and major sources of error
Evaluates the error associated with selecting a particular set of items Error simultaneous administration: content sampling Error in delayed administration: time and content sampling
Criterion-related evidence
Evidence based on relations to other variables • Concurrent validity evidence • Predictive validity evidence
Evidence based on internal structure
Factor analysis: Statistical method used to determine number of conceptually distinct factors or dimensions underlying a test or battery of tests -Looks at latent constructs (factors) which correlate with questions.
Cell C
False negatives (miss)
Cell B
False positives (miss)
Decision-theory models
Help the test user determine how much information a predictor test can contribute when making classification decisions Must consider: -base rates - the general rate in your population of interest -selection ratio - ex. how many positions you have open to how many people are applying
Percentile Rank
Indicates percentage of scores in the distribution that fall at or below that score
Concurrent validity
Indicates the extent to which test scores accurately estimate an individual's present position on the relevant criterion - test scores and criterion information are obtained simultaneously (ex written driving test, actual driving test)
Predictive validity
Indicates the extent to which test scores predict future criterion - test is administered, time interval, criterion is measured
Content-related evidence
Involves how adequately the test samples the content area of the identified construct *Item relevance - does each individual test item reflect essential content in the specified domain? (no questions that don't measure the construct) *Content coverage - degree to which all items cover the specified domain
Z-scores
Mean 0 SD 1 (X - M)/ SD = Z score
IQ scores
Mean 100 SD 15
T-scores
Mean 50 SD 10 T = 50 + [10 (X-M) / SD]
Standard scores based on nonlinear transformations:
Normalized standard scores - Standard scores based on underlying distributions that were not originally normal, but were transformed into normal distributions - nonlinear transformations Examples 1. Stanine scores 2. Wechsler scaled scores 3. Normal curve equivalent (NCE)
standard error of the measurement
Particularly useful in the interpretation of individiual scores. Allows us to easily compute a confidence interval within which scores should fall. -68% CI by +/- 1 SEM -95% CI by +/- 2 SEM -99% CI by +/- 3 SEM
Skewedness
Positively skewed - Tail to the right. Mode, median, mean. Negatively skewed - Tail to the left. Mean, median, mode.
Split-half reliability and major source of error
Provides a measure of internal consistancy. -Only provides a reliability analysis for half of the test -The longer the test is, the more reliable it will be Error: content sampling
Reliability
Refers to the CONSISTENCY of scores obtained by the same person on equivalent tests, on different occasions, under other variable examining conditions, or on equivalent sets of items
The relationship between reliability and validity
Reliability is a necessary but insufficient precursor to validity
Limitations of norms
Restricted to the standardization sample representativeness, specificity, size -It doesn't always generalize.
Percentile
Score at which a specified percentage of scores in a distribution fall below it
Classical theory of measurement
Test scores result from both: 1) factors that contribute to consistency, or true score 2) factors that contribute to inconsistency, or error (things that we pick up on that we aren't trying to measure X = T + E X= obtained score; T = true score; E = error In theory the true score is constant, but the error and obtained score are variable
Specificity
The ability of an instrument to detect the absence of a disorder when it is not present (i.e., detect normality) - (high specificity = high true negatives and not a lot of false negatives) D/(B +D) True negatives / number of cases with a negative outcome
Sensitivity
The ability of the instrument to detect a disorder when it's present ( high sensitivity = high true positives and low false positives). A/(A+C) True positives / number of cases with a positive outcome
Negative Predictive Value (NPV)
The proportion of "normal" cases that will be detected accurately. D/(C + D) True negatives / number of predicted negative cases
Positive Predictive Value (PPV)
The proportion of positive cases that will be detected accurately. Similar to sensitivity. A/(A + B) True positives / number of predicted positive cases
objective
Theoretically, most aspects of a test should be based on objective criteria. -test construction -scoring -interpretation
Cell D
True negatives (correct prediction - hit)
standardization
Uniformity of procedure in administration, scoring, and interpretation of tests. -reduces between subjects variability
Domain sampling theory
Where reliability coefficients come from? -Domain - population or universe of all possible items measuring a single trait or construct (theoretically infinite). -Test - sample of items from that test (the universe of items) -Reliability is the proportion of variance in the "universe" explained by test variance
WWI & WWII Contributions to testing
Yerkes came up with group tests of human abilities for classification and assignment. The army alpha and beta to screen recruits. Led to the development of tests for massive amounts of people at the same time
Criterion
a measure of some attribute or outcome that is of primary interest Academic performance (as reflected by GPA) Job performance (as measured by supervisor ratings)
Measures of Internal consistency and error
coefficient-alpha and KR-20: Use a single administration and a single form to provide an estimation of the consistency of responses to all items in the test Error: content sampling and item heterogeneity
Measures of central tendency
mean - most preferred when there is a normal distribution mode - most often used with nominal data median - most preferred when there is a scewed distribution
Pearson Product-Moment Correlation
most common coefficient because it accounts for: 1. Person's position in the group 2. individual deviation above or below the group mean
B + D
number of cases with a negative outcome
A + C
number of cases with a positive outcome
C + D
number of predicted negative cases
A + B
number of predictive positive cases
What are the main threats to content-related evidence?
o Construct underrepresentation : - failure to capture important components of a construct. Ex. a test that is supposed to measure comprehensive math skills only contains division problems. (it measures less of the construct than it is supposed to measure) o Construct-irrelevant variance: Occurs when the test measures something unrelated to the test construct. Ex. if a math test has complex written instructions it may be measuring reading comprehension as well as math skills.
What forms of evidence are used to evaluate validity?
o Content-related o Criterion-related o Construct-related
How is content-related evidence evaluated?
o Define construct o Identify domains of interest (determine from the literature what the components of the construct are) o Develop item pool to fit domains with adequate sampling of each - the amount of questions from different domains should be balanced) o Expert item analysis, expert review of entire instrument o Pilot test with feedback
Norms
performance by a defined group (i.e., the standardization sample) on a particular test (the majority of psychological test scores are determined by norms)
Cell A
true positives (correct prediction - hit)
How is criterion-related evidence evaluated statistically?
• validity coefficient: correlation between test and criterion • standard error of the estimate: margin of error to be expected in the predicted criterion score