Tests and Measures - Test1

Ace your homework & exams now with Quizwiz!

Galton

"Applied Darwinist" -"Founder of mental tests and measurements" responsible for launching the testing movement -He believed that tests of sensory discrimination could serve as a means of gauging a person's intellect -Anthropometric laboratory - 1st large scale systematic collection of data on individual differences

Reliability coefficient

(Rxx) Ratio of true score variance to the total variance of test scores σT2 / (σT2 + σе2) σT2 = variance of the true score, σе2 = variance of error (variance of true score) / (variance of true score + variance of error) -Minimal amounts of measurement error produce more consistent and reliable scores (reliability coeff closer to 1) -More error = less reliability, closer to zero

Multitrait-multimethod matrix

(see pg. 171 in book) we look at three different methods and three different traits ex methods - self-report, parent-report, teacher-report traits - depression, anxiety, adhd. The traits should correlate with themselves across measures but not correlate with other traits.

How is construct-related evidence evaluated?

*Convergent validity - correlate the test with existing tests measuring the same or similar constructs *Discriminant validity - correlate the test with existing tests measuring dissimilar constructs *Multitrait-multimethod matrix *Evaluation of internal structure such as factor analysis *Evaluation of response processes engaged in by examinees

Construct-related evidence

- Evidence based on response processes - Evidence based on internal structure (factor analysis) - Evidence based on consequences of testing - intended and unintended - Convergent & discriminant validity - Multitrait-multimethod matrix

Ordinal scale

- rank individuals or objects -but can't say anything about the difference between those ranks -ex. likart scale *Yes magnitude, no equal intervals, no abs zero

Normal curve

-68% of scores fall within 1 standard deviation of the mean (34% on each side) -96% within 2 sd of the mean (add 14% to both sides

Contributions to measurement error

-Time sampling error - error that comes from giving the test at different times -Item (or content) sampling error - comes from including items that dont correctly measure the construct -Test administration - not following guidelines perfectly -Test scoring -Systematic errors of measurement- A test consistently measures something other than what it is supposed to measure (not random).

Ratio scale

-absolute zero -ex. speedometer, height, weight *Yes magnitude, yes equal intervals, yes abs zero

Interval scale

-differences make sense, but ratio's do not (there is no ratio without an absolute zero) -temps, dates, ect *Yes magnitude, yes equal intervals, no abs zero

Evidence based on response processes

-do respondents view and understand items, instructions, and instrument in the same way and respond using anticipated methods -analysis of the fit between performance and actions the examinees actually engage in and the construct being assessed

Nominal scale

-labeling -naming objects -often arbitrary - no meaning *No magnitude, no equal intervals, and no abs zero

How is criterion-related evidence evaluated?

-review subject population in validity study -examine what the criterion really means - (is it valid and reliable?) -never confuse the criterion with the predictor (criterion contamination) - ex. gre's are used to predict success in grad school, some schools will admit low scores and then require a higher gre before graduation. Problem here: low gre scores succeed -third-variable effect -review evidence for validity generalization -consider differential prediction - it's a good predictor of phd's but not masters, ext.

Factors that can affect reliability

-test characteristics (ex. true or false more reliable than likart because there is less variability) -sample characteristics (ex. sample size) -extent of test's clarity

Evidence based on consequences of testing

-what are the expected and unexpected results, or consequences, or measurement?

Assumptions of classical test theory

1. Each person has a true score that would be obtained if there were no errors in measurement 2. Measurement errors are random (some will have positive effects and some will have negative, so over time the mean will be zero) 3. The mean error of measurement equals zero 4. True scores and error scores are uncorrelated 5. Errors on different tests are uncorrelated -Main point: if errors are unsystematic they are random. If they are systematic, we have a problem.

Psychological test

An objective and standardized measure of a sample behavior

Validity

Appropriateness or accuracy of the interpretation of test scores - Validity is a characteristic of test scores, it is not a characteristic of a test. Does the test measure what it was designed to measure?

Difference between test and assessment

Assessment - judgmental process that includes integration of multiple sources of information, including tests Test - single device or procedure

Interscorer reliability and major source of error

Consistency of rater judgement -Problems: subjectivity Error: differences due to raters/scorers

Spearman-brown

Estimates the reliability of the whole test when you only tested for half. You can also calculate the effect of adding/subtracting additional items on reliability (estimate the effect that changing the number of test items will have on reliability)

Test-retest reliability and major sources of error

Evaluates error associated with administering a test at two different times -generally used to evaluate constant traits Test-restest error: time sampling

Alternate-forms reliability and major sources of error

Evaluates the error associated with selecting a particular set of items Error simultaneous administration: content sampling Error in delayed administration: time and content sampling

Criterion-related evidence

Evidence based on relations to other variables • Concurrent validity evidence • Predictive validity evidence

Evidence based on internal structure

Factor analysis: Statistical method used to determine number of conceptually distinct factors or dimensions underlying a test or battery of tests -Looks at latent constructs (factors) which correlate with questions.

Cell C

False negatives (miss)

Cell B

False positives (miss)

Decision-theory models

Help the test user determine how much information a predictor test can contribute when making classification decisions Must consider: -base rates - the general rate in your population of interest -selection ratio - ex. how many positions you have open to how many people are applying

Percentile Rank

Indicates percentage of scores in the distribution that fall at or below that score

Concurrent validity

Indicates the extent to which test scores accurately estimate an individual's present position on the relevant criterion - test scores and criterion information are obtained simultaneously (ex written driving test, actual driving test)

Predictive validity

Indicates the extent to which test scores predict future criterion - test is administered, time interval, criterion is measured

Content-related evidence

Involves how adequately the test samples the content area of the identified construct *Item relevance - does each individual test item reflect essential content in the specified domain? (no questions that don't measure the construct) *Content coverage - degree to which all items cover the specified domain

Z-scores

Mean 0 SD 1 (X - M)/ SD = Z score

IQ scores

Mean 100 SD 15

T-scores

Mean 50 SD 10 T = 50 + [10 (X-M) / SD]

Standard scores based on nonlinear transformations:

Normalized standard scores - Standard scores based on underlying distributions that were not originally normal, but were transformed into normal distributions - nonlinear transformations Examples 1. Stanine scores 2. Wechsler scaled scores 3. Normal curve equivalent (NCE)

standard error of the measurement

Particularly useful in the interpretation of individiual scores. Allows us to easily compute a confidence interval within which scores should fall. -68% CI by +/- 1 SEM -95% CI by +/- 2 SEM -99% CI by +/- 3 SEM

Skewedness

Positively skewed - Tail to the right. Mode, median, mean. Negatively skewed - Tail to the left. Mean, median, mode.

Split-half reliability and major source of error

Provides a measure of internal consistancy. -Only provides a reliability analysis for half of the test -The longer the test is, the more reliable it will be Error: content sampling

Reliability

Refers to the CONSISTENCY of scores obtained by the same person on equivalent tests, on different occasions, under other variable examining conditions, or on equivalent sets of items

The relationship between reliability and validity

Reliability is a necessary but insufficient precursor to validity

Limitations of norms

Restricted to the standardization sample representativeness, specificity, size -It doesn't always generalize.

Percentile

Score at which a specified percentage of scores in a distribution fall below it

Classical theory of measurement

Test scores result from both: 1) factors that contribute to consistency, or true score 2) factors that contribute to inconsistency, or error (things that we pick up on that we aren't trying to measure X = T + E X= obtained score; T = true score; E = error In theory the true score is constant, but the error and obtained score are variable

Specificity

The ability of an instrument to detect the absence of a disorder when it is not present (i.e., detect normality) - (high specificity = high true negatives and not a lot of false negatives) D/(B +D) True negatives / number of cases with a negative outcome

Sensitivity

The ability of the instrument to detect a disorder when it's present ( high sensitivity = high true positives and low false positives). A/(A+C) True positives / number of cases with a positive outcome

Negative Predictive Value (NPV)

The proportion of "normal" cases that will be detected accurately. D/(C + D) True negatives / number of predicted negative cases

Positive Predictive Value (PPV)

The proportion of positive cases that will be detected accurately. Similar to sensitivity. A/(A + B) True positives / number of predicted positive cases

objective

Theoretically, most aspects of a test should be based on objective criteria. -test construction -scoring -interpretation

Cell D

True negatives (correct prediction - hit)

standardization

Uniformity of procedure in administration, scoring, and interpretation of tests. -reduces between subjects variability

Domain sampling theory

Where reliability coefficients come from? -Domain - population or universe of all possible items measuring a single trait or construct (theoretically infinite). -Test - sample of items from that test (the universe of items) -Reliability is the proportion of variance in the "universe" explained by test variance

WWI & WWII Contributions to testing

Yerkes came up with group tests of human abilities for classification and assignment. The army alpha and beta to screen recruits. Led to the development of tests for massive amounts of people at the same time

Criterion

a measure of some attribute or outcome that is of primary interest Academic performance (as reflected by GPA) Job performance (as measured by supervisor ratings)

Measures of Internal consistency and error

coefficient-alpha and KR-20: Use a single administration and a single form to provide an estimation of the consistency of responses to all items in the test Error: content sampling and item heterogeneity

Measures of central tendency

mean - most preferred when there is a normal distribution mode - most often used with nominal data median - most preferred when there is a scewed distribution

Pearson Product-Moment Correlation

most common coefficient because it accounts for: 1. Person's position in the group 2. individual deviation above or below the group mean

B + D

number of cases with a negative outcome

A + C

number of cases with a positive outcome

C + D

number of predicted negative cases

A + B

number of predictive positive cases

What are the main threats to content-related evidence?

o Construct underrepresentation : - failure to capture important components of a construct. Ex. a test that is supposed to measure comprehensive math skills only contains division problems. (it measures less of the construct than it is supposed to measure) o Construct-irrelevant variance: Occurs when the test measures something unrelated to the test construct. Ex. if a math test has complex written instructions it may be measuring reading comprehension as well as math skills.

What forms of evidence are used to evaluate validity?

o Content-related o Criterion-related o Construct-related

How is content-related evidence evaluated?

o Define construct o Identify domains of interest (determine from the literature what the components of the construct are) o Develop item pool to fit domains with adequate sampling of each - the amount of questions from different domains should be balanced) o Expert item analysis, expert review of entire instrument o Pilot test with feedback

Norms

performance by a defined group (i.e., the standardization sample) on a particular test (the majority of psychological test scores are determined by norms)

Cell A

true positives (correct prediction - hit)

How is criterion-related evidence evaluated statistically?

• validity coefficient: correlation between test and criterion • standard error of the estimate: margin of error to be expected in the predicted criterion score


Related study sets

Games and Strategic Behavior (Chapter 9)

View Set

Consumer choice (econ 101 midterm 2)

View Set

Bipolar and Depressive disorders CH 26

View Set

Chapter 3: Business Continuity Planning

View Set

Section 5.6 Part 1: Writing Ratios and Using Ratios in Applications

View Set

chapter 18: disorders of blood flow and blood pressure objectives

View Set

CH 8 Application: The Costs Of Taxation

View Set