Psych 440 Exam 2

Ace your homework & exams now with Quizwiz!

Ch.5- Reliability What is measurement reliability

- refers to stability or consistency of measurement - reliability is a matter of degree (not an all-or-none proposition

item Fairness

- An item is unfair if it favors one particular group of examinees in relation to another - Results in systematic differences between groups that are not due to the construct being tested - Persons showing the same ability as measured by the whole test should have the same probability of passing any given item that measures that ability (Jensen, 1980)

Validity and selection: The Taylor Russell Table

- Base Rate = proportion of the population who will meet the criteria - Selection Ratio = proportion of the population selected • As validity increases, the selection ratio will be improved over the base rate • With a small selection ratio, even a small increase in validity will help

Item Types: Advantages and Disadvantages

- Be familiar with Table 7-1: "Some Advantages and Disadvantages of Various Item Formats" - Examples: • Can sample a great deal of content in a relative short time (MC advantage) • Does not allow for expression of original or creative thought (MC disadvantage)

Interpreting CVR

- Calculated for each item -Values range from -1.0 o +1.0 • Negative: less than half indicating "essential" • Zero (0): half indicating "essential" • Positive: more than half indicating "essential" - Items typically kept if the amount of agreement exceeds chance agreement

Item-Discrimination (method 2)

- Choose items on the basis discrimination between two groups • Only works when you have clearly defined comparison groups - Different terms used to describe this method • Contrasted Groups • Empirically Keyed Scales • Actuarial designs

Step 1: Test Conceptualization

- Could be stimulated by anything: • societal trends • personal experience • a gap in the literature • need for a new tool to assess a construct of interest - Example: The International Personality Item Pool

Impact of facets on test scores

- Facets include: number of items in the test, training of test scorers, purpose of the test administration - If all facets in the environment are the same across administrations, we would expect the same score each time - If the facets vary, then we would expect the scores to vary

Guessing and faking

- Guessing is only an issue for tests where a "correct answer" exists • Not an issue when measuring attitudes - Faking can be an issue with attitudes • Faking Good: Positive self-presentation • Faking Bad: Malingering or trying to create a less favorable impression

Ideal average

- Ideal average pi is halfway between chance guessing and 1.0 - 4 option multiple choice • Chance is equal to .25 (1 out of 4) • Thus, optimal average difficulty = (.25 + 1.0)/2 = .625 - 5 option multiple choice • (.20 + 1.0)/2 = .60 - True-False Test • (.50 + 1.0)/2 = .75

Change evidence

- If a construct is hypothesized to change over time (e.g., reading rate) or not (e.g. personality) those changes should be reflected by either stability or lack of stability (depending on your theory) - Should the construct change after an intervention (e.g., psychotherapy, training)?

Types of rating scales

- Likert (and Likert-type)- taker presented with 5 alternative responses on some continuum (may also use 7 point scale) - Extremes assigned scores of 1 and 5 - are generally reliable - result in ordinal-level data - summative scale

Thurstone Scaling Method

- Process designed for developing a 'true' interval scale for psychological constructs - Start with a large item pool • Get ratings of the items from experts • Items are selected using a statistical evaluation of the judges ratings - Testtakers chooses items to match beliefs • Individual's score is based on judges' ratings

Item Difficulty Index

- The proportion of the total number of testtakers who got the item right (pi) - Examples: • p1 = .90; 90% got the item correct • p2 = .80; 80% got the item correct • p3 = .75; 75% got the item correct • p4 = .25; 25% got the item correct

Generalizability vs. True score

- True score theory does not identify the effects of different characteristics on the observed score - true score their does not differentiate the finite sample of behaviors being measured from the universe of all possible behaviors - with generalizability theory, we attempt to describe the conditions (facets) over which one can generalize scores

Assessing criterion validity

- Validity Coefficient - typically a correlation coefficient between scores on the test and some criterion measure (rxy) - Pearson's r is the usual measure, but may need to use other types of correlation coefficients depending on the data scale

Content Validity

- a judgement of how adequately a test samples behavior representative of the behavior that it was designed to sample

Item-Total Correlation

- a simple correlation between the score on an item and the total test score Advantages - can test statistical significance of the correlation - can interpret % of variability time accounts for (r it^2)

Standard error of the difference cont.

- a statistical measure to determine how large a difference should be before it is considered statistically significant

Expectancy Data

- additional information that can be used to help establish the criterion retailed validity of a test - usually displayed using an expectancy table that parses the likelihood that a test taker will score within some interval of scores on a criterion measure

Stage 3: Test Tryout

- after we have designed a test and have developed a pool of items, we need to determine which are the best - should use participants and conditions that match the tests intended use - rule of thumb is that initial studies should use five or more participants for each time in the test

Applications of generalizability theory

- all possible scores from all possible combinations of environment facets is called the universe score - this provides more practical information to be used in making decisions: - in what situations will the test be reliable? - what are the facets that most impact test reliability?

Concurrent validity (part of criterion validity)

- an index of the degree to which a test score is related to a criterion measure obtained at the same time Examples - new depression inventory correlated .89 with beck depression inventory scores - correlated .85 with psychiatric diagnosis of major depressive disorder

Predictive validity (part of criterion validity)

- an index of the degree to which a test score predicts scores on some future criterion - very expensive, start with concurrent and then move to predictive Examples - SAT or ACT scores predicting college GPA - personality and ability to measures predicting job performance - interests and values predicting job satisfaction

Classical test theory

- any measurement score yielded from some test will be the product of two components: 1. true score- the true standing on some construct 2. error- the part of the score that deviates from the true standing on the construct

Administration error

- anything that occurs during the administration of the test that could affect performance - environmental factors: temperature, lighting, noise, how comfortable the chair is, etc - test taker factors: mood, alterness, erros in entering answers, etc - examiner factors: physical appearance, demeanor, nonverbal cues, etc.

Reliability is not...

- are we measuring what we intend to measure - the appropriateness of how we use information - test bias *these are validity issues*

Interpreting validity coefficients

- can be severely impacted by restriction of range effects

Criterion-related validity

- criterion: the standard against which a test or a test score is evaluated - no strict rules exist about what can be used, so it could be just about anything A good criterion is generally: - relevant - uncontaminated - something that can be measured reliably

Generalizability theory cont.

- developed by Lee Cronbach - in this theory there is no "true" score - instead, a persons score on a given test will vary across administrations depending upon environmental conditions - these env. conditions are called facets

The reliability requirement

- do we always require a high degree of reliability?- NO - in what situations might you allow for a lower reliability coefficient? - depends on what you're trying to measure (Graduate school entry vs. screening for an increase in depression with a client) - what are the implications of lower coefficient?- you have to have a bigger confidence interval

Convergent validity

- does our measure highly correlate with other tests designed for or assumed to measure the same construct? - doesn't have to measure the exact construct. similar constructs are OK

Item-Validity Index

- does the item measure what it is purported to measure? - often evaluated using latent trait theory - evaluated using confirmatory factor analysis

Face Validity

- does the test look like it measures what it is supposed to measure? - has to do more with the judgements of the test TAKER, not the test user - psychometric soundness does not require face validity (and vice-versa)

Incremental validity

- does the test predict any additional variance than what has been previously predicted by some other measure?

Sources of error

- errors in test construction - errors in test administration - errors in test scoring and interpretation Possible to have all 3 - EX: Rorschach test

Interpretation depends on rates Base rate

- extent to which a particular trait, behavior, characteristic or attribute exists in the population

Coefficient Alpha

- first developed by Lee J. Cronbach - can be interpreted as the mean of all possible split-half correlation, corrected by the Spearman-Brown formula - ranges from 0 to 1 with values closer to 1 indicating greater reliability - most popular reliability coefficient with psychological research

Norm Vs. Criterion

- for a NRT, a good time is one where people who scored high on the test tended to get it right, and people who scored low tended to get it wrong - for CRT, the items need to assess mastery of the concepts - piolet comparison between people with and without mastery to see if times work properly

Calculating Coefficient Alpha

- formula is an extension of KR-20 - appropriate for use on non-dichotomous items because it considers the variance of each individual time in the equation

Empirically Keyed Scales

- goal is to choose items that produce differences between the groups that are better than chance - resulting scales often have heterogeneous contact and have a limited range of interpretation - used in clinical settings, especially for diagnosis of mental disorders (also used in career counseling)

Ch. 5 part 2: Reliability and Standard Error Reliability considerations Homogenous vs heterogenous test construction

- if a test measures only one construct, then the content of the test is homogenous - if multiple constructs are measured, the test connect is heterogeneous - with a heterogeneous test, a global measure of internal consistency will under-estimate reliability (compared to test-retest) - but internal consistency measures can be calculated separately for each construct

SEM in practice

- in practice we don't have a large number of trials for each person to capture their true score - typically we only have one score from one trial - that one score represents only one point on that theoretical distribution of potential test scores

Standard error of the difference

- in practice, the SEM is most frequently used in the interpretation of an individuals test scores - another statistic, the standard error of the difference is better when making comparisons between scores - scores between people, or two scores from the same person over time

Umbrella validity

- in the traditional model of validity, construct validity is an umbrella concept Construct--> content Construct--> criterion-related

Chapter. 6- Validity Validity

- is a general term referring to a judgement regarding how well a test measures what it claims to measure - similar to reliability it is not an all or none characteristic of a test

Generalizability theory

- is an alternate view (based on domain sampling theory) - suggests that a tests reliability is a function of the circumstances under which the test is developed, administered and interpreted

Test construction error

- item or content sampling - differences in item wording or how content is selected may produce error - this error is produced by variation in items within a test or between different test May have to do with - how behaviors are sampled - what behaviors are sampled

Guttman Scale

- items ranged from weaker to stronger expressions of a variable being measured - arranged so that agreement with stronger statement implies agreement with milder statements as well - produces ordinal data

Faking corrections

- lie scales - social desirability scales - fake good/bad scales - infrequent response items - total score corrections based on scores obtained from measures of faking - using measures with low face validity

Discriminant validity

- measure should not correlate with measures of dissimilar constructs - its a problem if our measure accidentally correlates highly with other measures/variables that it shouldn't

Step 5: Test Revision

- modifying the test stimuli, administration, etc., on the basis of either quantitative or qualitative item analysis Each time will have strengths and weaknesses - goal is to balance of these strengths and weaknesses for the intended use and population of the test

True variance and reliability

- more reliable tests have more true score variance - most psych tests are between .7 and .9 - easiest way to make a test more reliable, is to make it longer

Test construction: Scoring Items

- now that you have created a larger number of items, how are you going to score them? - decisions about scoring of times are related to the scaling methods used when designing the test Three options: 1. cumulative 2. class/categorical 3. ipsative

Reliability coefficients

- numerical values obtained by statistical methods that describe reliability - reliability coefficients have similar properties to correlation coefficients - generally range from 0 to 1 (negative values possible but not likely) - affected by the number of items

Interpretation depends on rates Hit rate

- proportion of people accurately identified as possessing or exhibiting some characteristic

Interpretation depends on rates Miss rate

- proportion of people the test fails to identify as having or not having a particular characteristic

Standard error of measurement

- provides an estimate of the amount of error inherent in an observed score or measurement - based on true score theory - inverse relation with reliability - used to estimate the extent to which an observed score deviates from a true score

Patricks key points

- reliability coefficients are influenced by the same issues as correlation coefficients - the type of test (norm, criterion, power, speed, etc) determines which types of reliability measure you can use - SEM and SED allows us to interpret test scores while taking into account reliability - confidence intervals are the preferred way to present test information

Item-Reliability

- remember, internal consistency is a measure of how well all items on a test are measuring the same construct - another way to determine this is to do a factor analysis - items that don't load on the main factor or that load weakly, can either be discarded or revised

Inter-rater or inter-scorer

- represents the degree of agreement (consistency) between multiple corers - calculated with Pearson r or spearman rho depending on the scale - proper training procedures and standardized scoring criteria are needed to produce consistent results

Test construction: Writing Items

- rule of thumb is to write twice as many times for each construct as what will be intended for the final version of the test - an ITEM POOL is a reservoir of potential items that may or may not be used on a test - if the pool has good content coverage, in order for good content validity in the test, the test must represent that pool

Reliability considerations Restriction of Range

- sampling procedures may result in a restricted range of scores - test difficulty may also result in a restricted range of scores - if the range is restricted, the reliability coefficients may not reflect the true population coefficient - just like what happens with a correlation coefficient

Step 2: Test construction

- scaling is the process of selecting rules for assigning numbers to measurement of varying amounts of some trait, attribute, or characteristic - no best way to sign numbers for all types of traits, attributes, or characteristics - but there may be an optimal method for the construct you want to measure

Reliability and error

- sources of error help us determine which reliability estimate to choose - each coefficient is affected by different sources of error - goal is to use the reliability measure that best addresses the sources of error associated with a test

Construct validation con.

- start with hypothesis about how that construct should relate to observables - also need to hypothesize how your construct is related to other constructs - prediction of what these inter-relations should be like based on a theory - evidence for a construct is obtained by the accumulation of many findings

Scoring and interpretation error

- subjectivity of scoring is a source of error variance More likely to be a problem with: - non-objective personality tests (Rorschach) - essay tests - behavioral observations - computer scoring errors? (when it makes a mistake it's much harder to catch)

Construct validation

- takes place when an investigator believes that his instrument reflects a particular construct, to which are attached certain meanings - the proposed interpretation generates specific testable hypotheses, which are a means of confirming or disconfirming the claim

Patricks key points

- test can be invalid for different reasons, including design issues, confounding variables and inappropriate use - different measures of validity are used to address concerns about threats to validity - choice of validity measure also depends on the type of test and its purpose

False negative

- test says it won't happen but it does

False positive

- test says yes thats going to happen but it doesn't happen

Types of reliability

- test-retest reliability - parallel or alternate forms of reliability - inter-rater or inter-scorer reliability - split-half and other internal consistency measures Choice of reliability measures depends on the tests design and other logistic considerations - Ex: practice affects on the GRE

Item-Discriminaton Index cont.

- used to compare the performance on a partially item with performance in the upper and lower regions of a distribution of continuous test scores - selected upper and lower regions that induce 25 to 33% of the sample will typically yield the best results - contrast the number who go that item "correct" in the upper and lower ranges of the total test distribution

Test interpretation

- the individual vs. the norm - test designers use normative date to provide an interpretive framework - test users are not interested in group scores - in applied settings tests are administered and interpreted with individuals - enter SEM

Construct validity

- the process of determining the appropriateness of inferences drawn from test scores measuring a construct

Test-retest reliability

- the same test is administered twice to the same group with a time interval between administrations - coefficient of stability is calculated by correlating the two sets of test results

Evidence for construct validity

- the test is homogenous, measuring a single construct - test scores increase or decrease as theoretically predicted - test scores vary b group as predicted by theory - test scores correlate with other measures as predicted by theory

Reliability considerations Criterion vs. Norm

- traditional reliability methods are used most with norm-referenced tests - criterion referenced tests tend to reflect material that is master hierarchically - reduced variability in scores, which will also reduce reliability estimates

Parallel forms

- two different versions of a test that measure the same construct - each form has the same mean and variance This one is better than alternate forms, but is more expensive

Alternate forms

- two different versions of a test that measure the same construct - tests do not meet the equal means and variances criterion Coefficient of equivalence is calculated by correlating the two forms of the test

Kruder-Richardson formulas

- two types: KR-20 & KR-21 - statistic of choice with dichotomous items (e.g. "yes" or "no") - when times are more heterogeneous, the KR-20 yields a lower estimate of r rKR20 = [k / (k-1)]*[1 - (∑pq/σ^2)] *don't need to know formula*

Guessing Correction Methods

- verbal or written instructions that discourage guessing - penalties for incorrect answers (i.i., test-taker will get no points for a blank answer, but will lose points for an incorrect answer) - not counting omitted answers as incorrect - ignoring the issue

Other types of rating scales

1. Method of Paired Comparisons- taker presented with two test stimuli and are asked to make some sort of comparison (you have to compare every possible pair) 2. Sorting tasks- takers asked to order stimuli on the basis of some rule - Categorical- placed in categories - Comparative- placed in an order

Patricks key points

1. Reliability is used to measure how consistent the results of a test will be - but this does not tell us anything about the appropriateness of the test 2. Choice of reliability measure depends on the type of test you are using - different reliability measures have different sources of error (also related to test type) 3. Adding items will typically increase reliability, but this is not always a practical solution

Test construction: choosing your item type

1. Selected response items generally take less time to answer and are often used when breadth of knowledge is being assessed (ex: multiple choice questions) 2. Constructed response items are more time consuming to answer and are often used to assess depth of knowledge (ex: essay question) - Selected response time scoring is more objective (therefore, more reliable) than the scoring of constructed response items

CH.8- Test Development Five stages of test development

1. Test conceptualization 2. Test construction 3. Test Tryout 4. Analysis 5. Revision

Content validity steps

1. content validity requires a precise definition of the construct being measured 2. domain sampling is used to determine behaviors that might represent that construct 3. determine adequacy of domain sampling- Lewshe's content validity ratio (CVR; 1975) method can be used to assess expert agreement on domain sampling

3 questions to ask

1. what makes a test invalid 2. what are the different ways of assessing validity 3. what type of validity evidence is best for different types of tests

Step 4: Item Analysis

A good test is made up of good items - good items are reliable (consistent) - good times are valid (measure what they are supposed to measure) - just like a good test Good items also help discriminate between test-takers on the basis of some attribute Item analysis is used to differentiate good items from bad items

Internal consistency

A measure of consistency within the test - how well do all of the times "hang together" or correlate with each other? The degree to which all items measure the same construct 3 ways to measure - split half (with spearman-brown) - Kuder-richardshon (KR-20 & KR-21) - Cronbachs Alpha

Multitrait-multimethod matrix

Both convergent and discriminant validity can be demonstrated using the Multitrait-multimethod matrix - Multitrait: must include two or more traits in the analysis - Multi-method: must include two or more methods of measuring each construct

Interpreting differences and changes

Changes in scores can occur across multiple test administrations for many reasons - growth - deterioration - learning - or just good old-fashion error are these differences due to real changes? or are they differences due to error

Reliability considerations Power vs. Speed

Power test - a test that has items that vary in their level of difficulty - most test takers will complete the test but will not get all items correct Speed test - a test where all time are approximately equal in difficulty - most test takers will get the answers right, but not finish the test - have varying number of questions asked

Messick's Validity

Divides validity into 2 categories 1. Evidential - face, content, criterion (concurrent, predictive), construct (convergent, discriminant), relevance and utility 2. Consequential - appropriateness of use deterred by consequences of use - EX: SAT, looked good on paper, but in cali it was producing biased college selection procedures

Item Indices

Four indices are used to analyze and select items: 1. indices of item difficulty 2. indices of item reliability 3. indices of item validity 4. indices of item discrimination

Guttman Scale Example

How do you know when you have too much money? a. I own a car b. I own an Audi c. I own an Audi S8 d. I own a silver Audi S8 e. I own an Audi S8 made of Silver

Item difficulty indices cont.

Ideally if we develop a test that has "correct" and "incorrect" answers - we would like to have those takers who are highest on the attribute to get more times correct than those that are not high on that attribute And ideally, whether or not someone gets an item "correct" should be based upon differential standings on that attribute

Item-Discriminaton Index

If discrimination between those with high and low on some construct is the goal - we would want items with higher proportions of higher scorers getting the item "correct" - and lower proportions of lower scorers getting the item "correct"

Reliability and SEM

If we have high reliability then we would expect highly consistent results - because those results are consistent, that SD of possible scores would be small - and the SEM would be small Reliability and SEM are inversely related

Calculating the "n" for Spearman-Brown

If you have 300 items, but would like to have a test of only 100 items. • n=100/300=.33 If you have 10 items, but would like to add 30 more items • n=30/10=3

Patricks key points

Items can be designed to measure breadth or depth of knowledge - it is difficult to measure both breadth and depth at the same time Item difficulty and item discrimination are both important considerations for selecting effective items for a test - optimal item difficult 9from a psychometric standpoint) may be impractical sometimes

Power vs. Speed cont.

Power test - can use all of the regular reliability coefficients Speed test - reliability must be based on multiple administrations 1. test-retest reliability 2. alternate-forms reliability 3. split half (special formula used)

Cross-Validataion

Re-estabilisng the reliability and validity of the test and other samples - may be conducted by developer or any other researcher with an interest in that area - must use a different sample than the one used in development, norming, or standardization

Calculating SEM

SEM for an individuals test score can be calculated from other statistics that provide information about true scores - SD of the distribution of test scores - reliability coefficient of the test

Split-half reliability

Simplest way to calculate internal consistency The following steps are used: - test items are split in half - the scores of each half of the test are correlated - correlation coefficient is correlated with the Spearman-Brown formula

Reliability considerations Dynamic vs Static

Static traits do not change much Dynamic traits or states are those constructs that can change over time - may be fairly stable (not as easily changeable or more durable over time) - but may also have quick changes from one moment to another

Spearman-Brown example 1

Test A with a total of 50 items is split in half using the even-odd method The correlation between the two halves is .80 - what is the reliability of the test? rsb =n*rxy /[1+(n-1)*rxy] rsb = [50/25]*.800 / [1 + (2 - 1)*.800] rsb = 1.6 /1.8 rsb = .89

Spearman-Brown example 2

Test B contains 100 items, but the publisher will only publish the test if it has 50 items. The reliability is currently .80 - what will the new reliability be? rsb =n*rxy /[1+(n-1)*rxy] rsb = .5*.800 / [1 + (.5 - 1)*.800] rsb = .4 /.6 rsb = .67

Spearman-Brown example 3

Test C currently has 50 items, but the reliability coefficient is only .70 The publishers would like a coefficient of at least .90 - how many items do you need? rsb =n*rxy /[1+(n-1)*rxy] .900 = n*.700 / [1 + (n - 1)*.700] n = 193 items

Test-retest sources of error

There are multiple sources of error that impact the coefficient of stability: - stability of the construct - time - practice effects - fatigue effects

Parallel-alternate error

There are multiple sources of error the can impact the coefficient of equivalence: - motivation and fatigue - events that happen between the two administrations - items selection will also produce error Used most frequently when the construct is highly influenced by practice effects

Spearman-Brown

This can be used to estimate the reliability of a test that has been shortened or lengthened - Generally, reliability increases as the length of the test increases, assuming the additional items are of good quality rsb = n*rxy / [1 + (n - 1)*rxy] Note: n is equal to the number of items in the version you want to know the reliability for, divided by the number of items in the version you have the reliability for

The reliability coefficient

Usually indicates the proportion of true variance in the test scores divided by the total variance observed in the scores R = σ^2True / σ^2Total Where σ^2Total = σ^2True + σ^2ε Total Variance = True Variance + Error Variance

Classical test theory formula

X=T+ε X= observed score on some test T= true score on some construct ε= error affecting the observed score

Item difficulty indices

You are trying to develop an achievement test or an intelligence test - what if 95% of the tryout sample answered the time correctly? - what if only 2% of the tryout sample answered an item correctly Would these be good questions for your test?

SEM in relation to classical test theory

observed score = true score +error The SEM (SD) is method of estimating the amount of error present in a test score Is a fraction of the reliability of the test (rxx) and the variability of test score (SDx)

SEM formula

σ meas = σ √(1 - rxx)


Related study sets

Quantitative Analysis Ch.1-4 Study Guide. sds

View Set

Music of Medieval, Renaissance, and Baroque Periods.

View Set