PSYCH 440- Exam 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

criterion referenced tests

designed to measure mastery of a set of concepts, not individual differences items are organized hierarchically, so they have limited/unequal variances and are not distributed normally

step 3: test tryout

determine which items are the best to use in a test rule of thumb: initial studies should use 5+ participants for each item in the test

standard error of difference (SED)

used to make 3 comparisons: same individual, same test different individuals, same test different individuals, different test

assessing criterion validity

validity coefficient- typically a correlation coefficient between scores on the test and some criterion measure (rxy)

interpreting content validity ratio

values range from -1 to 1 negative value: less than half indicating "essential" 0: half indicating "essential"/there is no agreement positive value: more than half indicating "essential"

true variance

variability between scores due to actual differences in standings on the construct that we are trying to measure

what is the fundamental idea of classical test theory?

we are getting observed scores, not an exact measurement, so we are estimating

reliability

stability or consistency of measurement

Thurston scaling method

process designed for developing a 'true' interval scale for psychological constructs

5 stages of test development

1. conceptualization 2. construction 3. tryout 4. analysis 5. revision (as needed, with tryouts)

steps to content validity

1: content validity requires a precise definition of the construct being measured 2. domain sampling is used to determine behaviors that might represent that construct 3. determine adequacy of domain sampling using Lewshe's Content Validity Ratio

test construction: scoring items

3 scoring options: cumulative, class/categorical, and ipsative (responses are compared to own person's responses)

item analysis: item indices

4 indices: item difficulty, reliability, vailidity, and discrimination

expectancy data

Additional information that can be used to help establish the criterion-related validity of a test Usually displayed using an expectancy table that parses the likelihood that a test-taker will score within some interval of scores on a criterion measure

true score confidence intervals

CI = observed score +/- (z-score)(SD) example: SEM = 15, student earns score of 5, what is CI for 95% level? --> 55 - (1.96)(15) = 25.6 --> 55 + (1.96)(15) = 84.4

interpreting validity coefficients

Can be severely impacted by restriction of range effects: underestimate validity of measure and need whole range of performance because it will not be effective in making predictions without it

Most test-takers score high on a particular test. What effect is most likely occurring?

Ceiling effect

homogeneity evidence

Do subscales correlate with the total score? Do individual items correlate with subscale or total scale scores? Do all of the items load onto a single factor using a factor analysis?

incremental validity

Does this test predict any additional variance than what has been previously predicted by some other measure?

change evidence

If a construct is hypothesized to change over time (e.g., reading rate) or not (e.g. personality) those changes should be reflected by either stability or lack of stability (depending on your theory) Should the construct change after an intervention (e.g., psychotherapy, training)?

types of scale methods

Likert Rating Guttman

examples of concurrent validity

New Depression Inventory correlate with Beck Depression Inventory Scores SAT scores compare with high school GPA

example of incremental validity

New IV does not overlap previous IV

reliability coefficients

Numerical values obtained by statistical methods that describe reliability range form 0 to 1, where negative values are unlikely

reliability coefficient formula

R = o2true / o2total total variance = true variance + error variance

examples of predictive validity

SAT scores predicting college GPA interests and values predicting job satisfaction news reporters catch a car accident live on tv when talking about a dangerous intersection

SEM formula

SEM = o* sqrt (1-rxx) o = standard deviation r = reliability

inter-rater reliability

degree of agreement or consistency between 2+ scorers/judges/raters with regard to a particular measure

evidence for construct validity

The test is homogenous, measuring a single construct Test scores increase or decrease as theoretically predicted Test scores vary by group as predicted by theory Test scores correlate with other measures as predicted by theory

classical test theory

X = T + E Observed score = True score + Error

criterion-related validity

a good criterion is generally: relevant (stuff we care about) uncontaminated (does not directly impact original measure)

power test

a test that varies in item difficulty most test-takers will complete the test, but not get all the answers correct can use all regular reliability coefficients

speed test

a test where the items are equal in difficulty and has a time limit most people will not finish the test, but will get the answers correct reliability must be based on multiple administrations: test-retest, alternate forms, and split-half

what are the ways to adjust group scores based on a biased test?

addition of points differential scoring items elimination of items based on differential item functioning differential cutoffs separate lists within-group norming banding preference policies

what is reliability not concerned with?

all things validity: measuring what we intended to measure appropriateness of how we use info test bias

generalizability theory

alternative view of classical test theory that suggests a test's reliability is a function of circumstances in which the test is developed, administered, and interpreted

administration error

anything that occurs during the administration of the test that could affect performance different types: environmental, test-taker, and examiner factors

if 5% of the population has depression, then 5% would be the _____ rate of depression for that poluation.

base rate

Why might ability test scores among test takers most typically vary?

because of the true ability of the test-taker and irrelevant, unwanted influences

What makes a biased test different than an unfair test?

biased test has something to do with test construction fairness has more to do with personal and cultural values

cronbach's alpha

calculated by using the mean of all possible split-half correlations ranges from 0 to 1, where the values closer to 1 indicate greater reliability

coefficient of equivalence

component of parallel or alternate forms that is calculated by correlating two forms of a test sources of error that impact this: motivation and fatigue, events that happen between the two administrations, item selection, and construct is highly influenced by practice effects

coefficient of stability

component of test-retest reliability that is calculated by correlating the two sets of test results sources of error that impacts this: stability of construct, time, practice effects, fatigue effects

construct validity

process of determining the appropriateness of inferences drawn from test scores measuring a construct connected to umbrella validity/ concept

item difficulty index

develop a test that has "correct" and "incorrect" answers "correct" items should be based upon differential standings on that attribute (those who are high for it) proportion of total number of test-takers who got the item right (p = .80, 80% got item correct) item-total correlation- simple correlation between score on an item and total test scores --advantage: can test statistical significance and interpret % of variability item accounts for

Kuder-Richardson

developed two types: KR-20 and KR-21 statistic choice with dichotomous items KR-20 yields a lower estimate of r when items are heterogeneous

item validity index

does item measure what it is purported to measure? evaluated using latent trait theory and confirmatory factor analysis

convergent validity

does our measure highly correlate with other tests designed for or assumed to measure the same construct? doesn't have to measure the exact construct, similar constructs are okay example: supposed to be correlated, and is correlated

face validity

does the test look like it measures what it's supposed to has to do with judgments of the TEST TAKER, not test user should consider how people will respond if they know the content of the test and high face validity can allow deception biggest issue is with credibility

Messick's validity

evidential: face, content, criterion, construct, relevance and utility consequential: appropriateness of use determined by consequences of use

base rate

extent to which a particular trait, behavior, characteristic, or attribute exists in the population inversely related: low base rate = harder to predict ex: when the next meteor hits earth

test tryout: faking

faking- issue with attitudes faking good= positive self presentation faking bad= trying to create a less favorable impression corrections: lie scales, social desirability, fake good/bad scales, infrequent response items, total score corrections based on scores obtained from measures of faking, and using measures with low ace validity

generalizability vs true score theory

generalizability attempts to describe the facets to generalize scores true score theory doesn't identify effects of different characteristics on the observed score

step 4: item analysis

good test items are reliable and valid help discriminate between test-takers based on an attribute used to differentiate good from bad items

test tryout: guessing

guessing- only an issue for tests where a "correct answer" exists methods: -verbal or written instructions that discourage guessing -penalties for incorrect answers -not counting omitted answers as incorrect -ignoring the issue

SEM and reliability

has an inverse relationship with reliability can assume a normal distribution between SEM and true score distribution Low SEM = more precision (vice-versa) reliable test = low SEM

validity

how well a test measures what it is supposed to measure

homogeneous test construction

if a test measures only one construct

impact of facets on test scores

if all facets are the same across administrations, we would expect the same score each time if facets vary, we would expect scores to vary

adding items to a test will do what to the reliability of the test?

increase the reliability

item reliability index

indication of the internal consistency of the scale and uses factor analysis so items can be discarded or revised

test construction error

item or content sampling: differences in wording or how content is selected can be produced by variation in items within a test or between different tests may have to do with how and what behaviors are sampled

Guttman scale

items range from weaker to stronger expressions of variable being measured/ when one item is endorsed, the less extreme positions are also endorsed produces an ordinal scale

content validity

judgment of how adequately a test samples behavior representative of the behavior that is was designed to sample

examples of environmental factors

lighting, noise, temperature, how comfortable the chair is

For a heterogeneous test, measures of internal-consistency reliability will tend to be ________ compared with other methods of estimating reliability.

lower, because the scores will not correlate as well and have a lower reliability

how can you improve a test's reliability to get the "truest scores"?

maximize true variance and minimize error variance

internal consistency

measure of consistency within a test (how well do the items "hang together" or correlate with each other? 3 ways to measure: split-half, kuder-richardson, and cronbach's alpha

psychological traits

measures of things that are context dependent or have quick changes form one moment to another

discriminant validity

measures shouldn't correlate with measures of dissimilar constructs is a problem when a measure accidentally correlate highly with other measures/variables it shouldn't ex: shouldn't correlate and it doesn't

false positive

miss tool of assessment indicating the test-taker possesses or exhibits a particular trait, ability, behavior, or attribute when in fact the test-taker does not ex: test is positive, shock wears off

step 5: test revision

modifying test stimuli, admin on the basis of either quant/qualitative item analysis goal is to balance strengths and weaknesses for intended use and population of test cross-validation- re-establishing reliability and validity of the test with other samples; conducted by developer with an interest in that area and uses a different sample than one used in development item-fairness- item is unfair as it favors one particular group of examinees in relation to another

examples of test-taker variables

mood, alertness, errors in entering answers

what are the 3 types of selected-response format items?

multiple-choice matching binary choice (T/F)

item discrimination explanations

no discrimination because experts and novices are the same (0 discrimination value) discriminates in expected direction because experts are better than novices (positive discrimination value) discriminates in expected direction if intention is for novices to do better than experts (negative discrimination value)

examples of facets

number of test items training of test scorers purpose of test administration

examples of examiner-related variables

physical appearance, demeanor, non verbal cues

true variance and reliability

positive, linear relationship 0 reliability means you are measuring only error perfect reliability = 1.0 as reliability increases, you measure more of what you want to

true score estimates

observed score is best estimate of the true score observed score and estimate of true score are both 100

SED formulas

odiff = sqrt (o2mean1 + o2mean2) odiff = o * sqrt (2 - r1 - r2) SED > SEM because it accounts for two sources of error

what type of reliability estimate is the most expensive to obtain?

parallel form because you would need to verify both tests

error variance

part of the observed score that is due to irrelevant, random sources

error

part of the score that deviates from the true score

hit rate

proportion of people accurately identified as possessing or exhibiting some characteristic want a high hit rate to show correct outcome

miss rate

proportion of people the test fails to identify as having or not having a particular characteristic what you are concerned about want a low miss rate 2 types: false negative and positive

selection ratio

proportion of population that is selected as validity increases, the selection ratio will be improved over the base rate with a small selection ratio, even a small increase in validity will help

intentional admin error

responding to a question without listening to it (example: obscure answers on family feud)

test construction: writing items

rule of thumb: write 2x as many times for each construct as what will be intended for the final test version item pool is a reservoir of potential items that may or may not be used on the test; good content validity = items that represent that item pool

step 2: test construction

scaling is the process of selecting rules for assigning number to measurement of varying amounts of some trait, attribute, or characteristic includes rating scales

test construction: choosing item type

selected response items- less time to answer and breadth of knowledge constructed response items- time consuming to answer and depth of knowledge selected response item scoring- more objective and reliable than scoring of constructed response items

error and reliability

sources of error help us determine which reliability estimate to choose

step 1: test conceptualization

stimulated by societal trends personal experience gap in literature need for new tool to assess a construction norm-referenced test , a good item is one where people who scored high on the tend to get it right, low score = tended to get it wrong

scoring and interpretation error

subjectivity of scoring is a source of error variance more likely a problem with non-objective tests essay tests, behavioral observations, and computer scoring errors

method of paired comparisons

taker presented with 2 test stimuli and are asked to make some sort of comparison

Likert-type rating scale

taker presented with 5 alternative responses on some continuum extreme signed scores of 1 and 5 that are generally reliable, result in ordinal-level data, and have a summative scale

sorting tasks

takers asked to order stimuli on basis of some rule categorical- placed in categories comparative- placed in an order

construct validation

takes place when an investigator believes that their instrument reflects a particular construct, to which are attached certain meanings proposed interpretation generates specific testable hypotheses, which are means of confirming or disconfirming the claim

sources of errror

test construction test administration test scoring and interpretation **can be all 3

split-half reliability

test items are split in half and then the scores are correlated; uses Spearman-Brown

false negative

test predicted that the test-taker did not possess a particular characteristics or attribute being measured when the test-taker actually did ex: covid/pregnancy test is negative, but symptoms persist

types of reliability

test-retest parallel or alternate forms inter-rater split-half and other internal consistency measures

concurrent validity

the degree to which a test score is related to a criterion measure obtained at the same time; compare your measure to an already established measure

predictive validity

the degree to which a test score predicts scores on some future criterion strongest form of validity

test-retest reliability

the same test is given twice to the same group at different times

Spearman-Brown

used to estimated reliability of a test that is shortened or lengthened reliability increases as the length of a test increases rab = n*rxy / [1+ (n -1) * rxy]

true score

the true standing on some construct

multitrait-multimethod matrix

tool used to evaluate both convergent and discriminant evidence, along with factor analysis Multitrait - must include two or more traits in the analysis Multimethod - must include two or more methods of measuring each construct

criterion vs norm referenced

traditional reliability methods are used with norm-referenced tests criterion tests tend to reflect material that is mastered hierarchically that will reduce variability scores and reliability estimates

dynamic traits

traits that change slowly over time example: developmental, effects of experience

static traits

traits that do not change over time

parallel or alternate forms

two different versions of a test that measure the same construct differs because parallel forms has equal means and variances

item discrimination index

used to compare performance on a particular item with performance in the upper and lower regions of distribution of continuous test scores symbolized by 'd'- compares proportion of high scorers getting item "correct" and proportion of low scorers getting item "correct" method 2: choose items on basis discrimination between groups and different terms (contrasted groups, empirically keyed scales, and actuarial designs) most difficult indexes to least difficult (lowest to highest number)

standard error of measurement (SEM)

used to estimate the extent to which an observed score deviates from the true score

heterogeneous test construction

when a test measures multiple constructs

restriction of range

when there is not a full range of scores, reliability coefficients may not reflect the true population coefficient influenced by sampling procedures and test difficulty


Ensembles d'études connexes

IGGY Ch.7 Elsevier - End-of-Life Care

View Set

Bone Development, Bone Homeostasis: Remodeling and Repair, Fractures

View Set

Algebra 1 B, Unit 3, Assignment: Transformations

View Set

Chapter 16, Chapter 17, Chapter 19, Chapter 18

View Set

Chapter 51: Drugs for Heart Failure

View Set