Chapter 10 - Validity, Ch. 4 Reliability

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Internal Consistency

1 test, multiple items Split Half or Internal reliability Cronbach's Alpha

Standard Error of Measurement answers

"How close is this test result to the true result"? (How accurate is it?)

Observer Differences

1 test, w/ 2+ observers Inter-observer reliability Kappa

Confidence Interval answers

"how likely is a true score to fall within a range"? (e.g., SAT test scores)

Cohen's Kappa ranges:

-1 to 1 "poor" < .40 "good" .40 to .75 "excellent" > .75

Split-half method

The results of 1/2 of the test are compared w/ the results of the other 1/2 (using subscore for odd-numbered items and even-numbered items)

Time Sampling

1 test given twice Test-retest reliability Correlation between scores

4 approaches to Construct Validation Studies

1) Correlation between a measure of the construct and designated (correlation between aptitude test and measure of job performance (construct)) 2) Differentiation between groups 3) Factor Analysis - n measurements on the same examinee and running an nxn F.A. 4) Multi-trait - Multimethod Matrix

Three main types of validation studies (Validity Taxonomy)

1. Content Validation 2. Criterion-Related Validation 3. Construct Validation

Steps to Content Validation

1. Define the performance domain of interest 2. Select a panel of qualified experts in the content domain 3. provide structured framework for the process of matching items to the performance domain 4. Collect and summarize the data from the matching process

Item Sampling

2 different tests given once Parallel (Alternative) forms Correlation between scores

...20% of participants selected and 90% success rate from those selected

20% of participants selected and 90% success rate from those selected

Ratio of reliability - 1.0

= the % of random error (e.g., r=.40, so 40% of variation in the score is due to variation in the "true" score, subtract this from 1.0, 60% of variation is due to random error)

Focus test Methods:

Ad-hoc (informal) - face validity of items & remove items that don't fit Statistical: Factor Analysis & Discriminability Analysis

Reliability can be improved through ________, _______ and _______, and _______.

Ad-hoc methods factor analysis discriminant analysis statistically

Test-retest reliability

Administer same test across some time period, computer correlation between 2 administrations (only used to measure stable traits - ex. NOT the Rorschach inkblot test)

Estimates of Alpha:

Alpha of .70 - .80 = borderline Alpha of .80 = ok Alpha of .90 & higher = good

Charles Spearman (1904)

Article - "The proof of measurement of association between two things" & invented Rho (correlation for ordinal variables)

Pearson, Spearman, Thorndike (1900-1907)

Basic reliability theory

Evidence based on relations to other variables

Can the test usefully predict a certain type of outcome - the Criterion (can a GRE score predict GPA as a grad student)

Test-retest problem:

Carryover effect - when 1st testing session influences scores from 2nd session and changes the True score (e.g., participant remembers their answers from the 1st test)

Multi-trait, multi-method evidence

Combination of convergent and discriminant evidence

Reliability of difference of scores

Common to take difference between 2 scores Problem: taking the difference dramatically reduces the reliability (e.g., 2 tests w/ reliability .90 and .70, correlated to each other by .70, the difference score has a reliability of .33)

Confidence interval formula

Confidence Interval = Z * SEM

Inter-rater reliability

Consistency among different judges who are evaluating the same behavior Observer (rater) errors occur (e.g., was that a "hit" or a "punch")

Reliability

Consistency of measurement

Reporting Criterion-related studies

Continuous Variables - PPMC coefficient - called a validity coefficient Categorical Variables - Phi

Most reliability measures are...

Correlation Coefficients

Split-halves are supplanted by...

Cronbach's Alpha

Test Content Evidence

Define the performance domain item review by experts panel reliability/validity study (absolute vs. relative comparison)

Measurement Error (E)

Difference between the observed score and the true score E = X - T Observed score = True score + Error X = T + E

Evidence based on internal structure

Discovery, definition, and discrimination of the psychological construct theorized to underlie the observed variation among a sample of observed test scores. (linear and non-linear factor analysis)

Errors of measurement

Discrepancy between true ability and measurement of ability

Classical test score theory

Each person has a True score that would be obtained if there were no errors in measurement

Assumption in classical test theory

Errors are random True scores have no variability and that distribution is normal with a Mean of zero

Cronbach's Alpha

Estimates a lower bound for reliability The actual reliability may still be high (alpha can confirm a test has reliability but can't confirm if a test is unreliable)

Specific forms of validity evidence

Evidence based on test content or consequences of test Evidence regarding response processes, internal structure, or relationships with other variables

Construct

Exist but can't be directly measured (e.g., IQ)

How much to increase N?

Formulas used to determine how much to increase N by to reach a certain level of reliability Nd = rd(1-ro) / ro(1-rd) Nd = new N rd = desired level of reliability ro = observed level of reliability

Internal Consistency Reliability

Give single test, calculate internal consistency of various subsets of items

Steps to designing a criterion-related validation study

Identify a suitable criterion behavior and method for measuring it Identify an appropriate sample Administer the test and record scores When criterion data are available, obtain a measure of performance for each examinee Determine the strength of the relationship between test scores and criterion performance

Ways to increase reliability

Increase N (# of items, tests) Focus on single characteristic Use Factor Analysis to determine sub characteristics of a single test Use Item Analysis to find items that measure a single characteristic Statistically correct for attenuation

Drasgow et al (late 1990s)

Item Response Theory (IRT)

Bartholomew & Knott (1990s)

Latent variable theory

Sources of Error (True score - Observed score):

Measurement errors Change in True score (e.g., situational factors: room is too hot, test taker was depressed, etc.)

Kappa definition

Method for assessing the level of agreement among several observers Values: -1 (less agreement than based on chance alone) to 1 (perfect agreement)

Evidence based on response processes

More than one type of response (Art, showing work in math) or developmental/theoretical change (age differentiation, intervention based)

Problem of Domain Sampling:

No way to measure True score because No way to measure every possible item

Result of carryover effect

Overestimates the reliability; only a concern when changes are random, not systematic (e.g., if everyone's score improved by 5 points)

Samuel George Morton

Polygenism (Humans are composed of different species) Craniometry Biological Determinism "Scientific Racism" 50 yrs. before Spearman's work

Convergent Evidence

Positive correlations with similar constructs (variables)

Karl Pearson (1896)

Product-moment correlation (for continuous variables)

r =

Reliability Coefficient of test

Kuder & Richardson (1937), Cronback (1989)

Reliability coefficients

Focus Test

Reliability increases the more the test focuses on a single concept or characteristic vs. capturing multiple concepts in a single test reduces reliability

When reliability is known, we get _____, and from _____ we get ________.

SEM SEM Confidence Intervals

Standard Error of Measurement formula:

SEM = S(sq. root 1-r)

Solution of Domain Sampling:

Sample a limited subset of items by creating 1 or > tests, which will give a normal distribution of the true score Correlation between the 2 tests will be lower than the correlation between 1 test & the True score

Concurrent Validity

The relationship between test scores and criterion measurements made at the time the test was given (pilot written test and performance test)

Observed score (X)

Score as measured by a test (different from a person's true ability)

Two pre-requisites for ethical test use

Scores need to be reliable and inferences valid

Observational data differs from...

Self-report data

Standard error of measurement for Classical test theory

Standard Deviation

3 ways to estimate test reliability

Test-retest method, Parallel Forms, & Internal Consistency

True score (T)

The actual score that exists

Predictive Validity

The degree to which test scores predict criterion measurements that will be made at some point in the future (GRE test and Grad School GPA)

Reliability (Correlation) Coefficient definition

The ratio of the variance of the True scores (T) to the variance of the Observed scores (X) (given as a percentage)

Interpreting validity coefficients

The square of the CC is called the coefficient of determination p=.6 CoD=.36 => 36% of the variance in job performance is related to variance in performance on the predictor test

How to conceptualize error?

To understand complex traits, psychologists use reliable tests.

Practice effect

Type of carryover effect; some skills improve w/ practice

Domain Sampling: Correlation of any 2 random sample tests

Unbiased estimate of "true" reliability

Problem with Split-half method

Underestimates reliability b/c each subtest is only 1/2 as long as the full test, fewer items tested are less reliable than more items tested.

discriminant evidence

Zero (or near zero) correlations with dissimilar constructs (variables)

SEM

a confidence interval around our observed score that estimates a true score

Standard deviation of the distribution of error tells us...

about the magnitude of measurement error

Parallel forms reliability

administer 2 versions of the test to same subjects (often on the same day), compute correlation between 2 administrations

As the sample gets larger...

becomes a more accurate and reliable representation of the domain

SEE

given the reliability of our test, how well does a score predict performance on another test?

The Standard Error of Measurement tells us (on average)...

how much a score varies from the True score

Dispersions around the true score tells us...

how much error there is in the measure

Test validation is a process that includes

integration in assessment design and gathering other evidences for convergent and discriminant evidences

Z score associated with...

percentage of range

Pros & Cons of Parallel Forms

pro - most rigorous method of determining reliability, con - difficult to do, is not often done

Reliability r =

r = .70 or .80 "good enough" for most research r > .90 high reliability

There is more than one type of _________, to measure it we need to specify _____________ comes from

reliability where the measurement error

Domain Sampling Model

taking a sample from a domain to determine your "true" score

Use of test scores

test user be able to justify the inferences drawn by having a cogent rationale for using the test score for that purpose opposed to other available testing procedures

Criterion-related validate

test user desires to draw an inference from examinee's test score to performance on some real behavioral variable of practical importance

Content Validation

test user desires to draw an inference from the examinee's test score to a larger domain of items similar to those on the test

Construct validation

test user desires to draw an inference from the test score to performance that can be grouped under the label of a particular psychological construct "No criterion or universe of content is accepted as truly adequate to define the quality to be measured" (Cronbach and Meehl, 1955)

Reliability is measured by correlating...

the Observed score with the True score

Error impli

the component of the observed test score that does not have to do with the testtaker's ability

Brook's definition of Validity

the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests

Cronbach's definition of validation

the process by which a test developer or user collects evidence to support the types of inferences that are drawn from the test scores

If a test is ______, it is irrelevant whether or not it is ______. _______ is a foundation

unreliable valid reliability

If we know the Reliability (r) of the test...

we can estimate the likely range of true values


Ensembles d'études connexes

Cisco IP Addressing / Modules 11 - 13

View Set

Chapter 15 Nursing Informatics Questions

View Set

Self-Enhancement and Self-Affirmation

View Set

Binary, Hex, & ASCII - Vocab & Content

View Set

MGT 373 Test 1 Human Resource Management

View Set

microsoft network admin 2 chapter 12 test

View Set

28 - The Persian Wars Study Guide

View Set