PSYC3020 Lecture 3 - Reliability

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What is true score theory?

- Same as Classical Test Theory - It is the idea that every measurement we take (measured/observed test score) can be decomposed into two parts: 1. The true score (underlying thing that our test is measuring) 2. Measure error (everything captured by our actual test score that ISN'T the underlying thing our test is measuring

What is reliability in terms of the relationship between true and total test score variability?

-In Classical Test Theory, reliability is the ratio between the TRUE VARIABILITY (hypothetical distribution of test scores in a sample if no measurement error) and the TOTAL VARIABILITY (the actual distribution of test scores, which includes error). -Reliability is the proportion of the egg that is the yolk LOWER MEASUREMENT ERROR = HIGHER RELIABILITY

List the various sources of measurement error

1. Test construction e.g. domain sampling/item sampling/content sampling 2. Test administration e.g. distractions during test, fatigue, Red Bull poisoning, invigilator demeanour 3. Test scoring e.g. biased examiners, ambiguous scoring guidelines, technical errors 4. Other influences such as self-efficacy, motivational factors etc.

Law of Large Numbers

A large and diverse sample of evidence for someone's past behaviour is likely to be a good predictor of their future behaviour. A single example of someone's past behaviour will probably be a poor predictor of their future behaviour.

What is item /content sampling?

Also referred to as content sampling, the variety of the subject matter contained in the items; frequently referred to in the context of the variation between individual test items in a test or between test items in two or more tests

Internal consistency

Average correlation between the items on your scale. If all items on scale supposed to be measuring the same thing, do individuals give consistent responses across items? Can also be called INTER-ITEM consistency or INTERNAL COHERENCE.

Imagine you create an ability test which involves an examiners making a number of ratings of an individual's thumb-rolling skill. You test the inter-rater reliability using two examiners and obtain a correlation of 0.87. What does this mean

Both examiners have a high correlation of 0.87 it means they are similar in their ratings.

What is classical test theory?

Classical Test Theory is the traditional conceptual basis of psychometrics - it's also know as "True Score Theory" - it is the idea that every measurement we take (measured/observed test score) can be decomposed into two parts: 1. The true score (underlying thing that our test is measuring) 2. Measure error (everything captured by our actual test score that ISN'T the underlying thing our test is measuring

True of False According to Classical Test Theory, if a test has very high reliability then the measurement error must be very high

False

True or False 0.50 is usually considered the typical minimum threshold for reliability

False

True or False 0.90 is usually considered the typical minimum threshold for reliability

False

True or False If we double the number of items in a test, then the SB formula predicts that we should double it's reliability.

False

Classical Test Theory can be described by the formula X=T/E (where X is the observed score, T is the true score, and E is the measurement error)

False (X=T+E not X=T/E)

True or False True Score Theory involves conceptualising test score variability as comprising true test score variation and total test score variation

False (comprises true test score variation and measurement error)

True or False Reliability conceptualised in Classical Test Theory is being total test score variation minus measure error

False (ratio of true variability and total (observed) variability

As part of the process of calculating Cronbach's alpha, you have to adjust for the homogeneity of these by applying the Spearman-Brown formula.

False - it's not for homogeneity - we are adjusting for chopping tests in half

True or False Imagine you create an ability test which involves an examiner making a number of ratings on an individuals's acrobatic skill. You test the inter-rater reliability using two examiners and obtain a reliability of .88. This means on examiner's rating is independent of the other's rating.

False - that would mean there is no correlation

True or False As part of the process of calculating Cronbach's alpha, you have to multiply the correlations derived from all possible ways of splitting the test into two

False - work out correlations for each half possibility, then average all possibilities.

True or False The fact that we cannot ask students everything about the course in the quizzes will decrease the proportion of the observed score that can be attributed on the true score (assuming the quiz marks are supposed to reflect students' overall PSYC3020 knowledge)

False I think. It will cause the scores to vary depending on what questions are given

Types of reliability evaluation Internal consistency

How much the item scores in a test correlate with one another on average. Source of error: consistency of items within the same test (do they all measure the same thing?)

Types of reliability evaluation Inter-rater reliability

If a test involves an examiner making a rating - get two of them to do the rating independently and see how much their ratings correlate (also compare means and SDs of ratings if appropriate). Source of error: Different observers / examiners / raters recording outcomes.

1. Homogeneity/heterogeneity of the test

If measure HETEROGENEOUS then internal consistency might be an inappropriate estimate of reliability (though you could instead look at the internal consistency of each sub scale separately). HOMOGENEOUS test - if the test items all measure the same thing - uni dimensional HETEROGENEOUS test - if more than one independent thing is being measure (i.e. there are sub scales that don't intercorrelate highly).

Types of reliability evaluation Parallel-forms reliability

If people do two different versions of the same test, how much do their scores on the two versions correlate? Source of error: Item sampling (e.g. different items used to asses the same attribute)

Types of reliability evaluation Test-retest reliability

If people sit the same test twice, how much do their scores correlate between the two sittings? Source of error: Time sampling (does something that should be the same over time actually vary because of error?)

3. Restriction of range/variance

If scores in sample are inappropriately restricted in the amount they can vary then this will affect the correlation (and ALL our reliability estimates are based on correlations). This means we have to be careful in interpreting ANY of the reliability estimates (i.e. try and avoid have a restriction in the range of scores).

Inter-rater reliability

Inter-rater reliability can be measured by looking at the correlation between scores on the same test by the same people provided by 2 different examiners.

Statistic we can use to evaluate test-retest reliability

Jamovi - click on "regression" then "correlation matrix". Click on the variables to transfer them into the right hand box. (don't need Cronbach's alpha)

5. Criterion-referenced tests

May be very little variation in people's responses. e.g. in some pass/fail test virtually everyone might pass (e.g. driving certification) This is e.g. of RESTRICTION OF RANGE. If there's no variation then problem to use any of the reliability estimates as they are all derived from assessing score variability. No variability - can't do correlations

Cronbach's alpha

Measure for internal consistence

True or False If a test was unreliable then the true test score variability would only be a small proportion of the actual test score

Not sure??? True I think

Parallel Forms reliability

Parallel Forms (or Alternate Forms) reliability is the correlation between scores on 2 versions of the same test by the same people done at the same time. (don't need Cronbach's alpha)

Definition of reliability

Reliability 'refers to the accuracy, dependability, consistency, or repeatability of test results'

Reliability and the number of items in a test

Reliability increases with more items, decrease with less items. This is the effect of domain sampling (= item sampling = content sampling = the idea that in any test, you're only testing a sample of what could possibly be tested). With more items, you have more samples of the domain of interest, which means that the test score becomes a better representation of the "total domain" score. More items - more robust to effects of occasional atypical responses (e.g.accidentally clicking the wrong thing.

Why can we only estimate the reliability of a test and not measure it directly?

Reliability refers to the degree to which test scores are free from errors of measurement. Because true variance is hypothetical/theoretical, we can't actually measure it (or reliability) directly. Instead we estimate reliability via test-retest, parallel-forms, internal consistency, and/or inter-rater reliability.

Test-retest reliability

Rest-retest reliability is the correlation between scores on the same test by the same people done at two difference times.

4. Speed tests vs power tests

SPEED test - speed of response POWER test - level of difficulty of response (e.g. intelligence) Speed - internal consistency not appropriate (because people tend to get all the questions they attempt correct but they just don't have time to attempt all the Qs) - give spurious correlation. Speed - use parallel-forms or test-retest reliability

2. Static vs dynamic characteristics

STATIC - test measuring something that is static - meant to remain the same over time. (e.g. Intelligence) DYNAMIC - expect to change over time (e.g. fatigue, state anxiety) If dynamic - TEST-RETEST reliability - problem because assumes the thing being measure says the same

How to calculate Cronbach's alpha in Jamovi?

Select FACTOR then RELIABILITY ANALYSIS. Select all items in your scale and move them to the "Items" box.

Cronback's alpha - hand calculations

Step 1: Split questionnaire in half Step 2: Calculate total score for each half Step 3: Work out correlation between the total scores for each half Step 4: Repeat steps 1-3 for all possible two way splits of the questionnaire. Step 5: Work out average of all possible split-half correlations. Step 6: Adjust the correlation to account for the fact that you've shortened (halved) the test by applying a correction call the SPEARMAN-BROWN formula

What estimate of reliability to use when? Which measure of reliability to use depends on the circumstances (ideally, calculate as many estimates of reliability as possible).

The following situations can affect which reliability estimates you can use: 1. Homogeneity/heterogeneity of the test 2. Static vs dynamic characteristics 3. Restriction of range/variance. 4. Speed tests vs power tests 5. Criterion-referenced tests

True or False 0.70 is usually considered the typical minimum threshold for reliability.

True

True or False According to Classical Test Theory, if a test has very high reliability then the measure error must be very low

True

True or False As a part of the process of calculating Cronbach's alpha, you have to average the correlations derived from all possible ways of splitting the test into two.

True

True or False As part of the process of calculating Cronbach's alpha, you have to adjust for the fact you've halved the test by applying the Spearman-Brown formula.

True

True or False Imagine you create an ability test which involves an examiner making a number of ratings of an individual's acrobatic skill. You test the inter-rater reliability using two examiners and obtain a reliability -0.98. This means one examiner is giving the opposite rating to the other.

True

Test-retest reliability involving giving the same test twice. This might be a problem

Use parallel versions of the same test to get around this (different but equivalent stimuli) If doing this - good idea to counterbalance the two versions

The Spearman-Brown prediction formula

We can estimate how the reliability would change if your tests shortened or lengthened using the Spearman-Brown formula rsb=(n x rxx)/(1+((n-1) x rxx)

For inter-rater reliability, when might we want to examine the means and SD of two raters' ratings in addition to the correlation between them.

When it is a criterion referenced test (i.e. absolute values matter, e.g. tutors marking assignments), then we might want to examine means and SDs too.

X = T + E

X - Observed Score of total test score variation T - True Score (the 'real' score) or true test score variation E - Errors of Measurement (random variability in the test score data that is unrelated to the true score)

What is content sampling?

a small selection of all the possible questions

Higher reliability

if a person took the same test multiple time, we'd expect their scores to be less spread out due to measure error

Lower reliability

if a person took the same test multiple times, we'd expect their scores to be more spread out due to measurement error

Hand calculate Cronbach's alpha Step 6: SPEARMAN BROWN formula

rsb = 2rxx / 1 + rxx

What is domain sampling?

tests are constructed by randomly selecting a specified number of measures from a homogeneous, infinitely large pool. An item domain is a well-defined population of items from which one or more test forms may be constructed by selection of a sample of items from this population


Ensembles d'études connexes

4th Grade Social Studies: States of the Southeast Region

View Set

chapter 11: stockholders' equity

View Set

N204 Midterm Exam Prep Questions

View Set

Chapter 36 - Coronary Artery Disease (Questions)

View Set