PSYC3020 Lecture 3 - Reliability
What is true score theory?
- Same as Classical Test Theory - It is the idea that every measurement we take (measured/observed test score) can be decomposed into two parts: 1. The true score (underlying thing that our test is measuring) 2. Measure error (everything captured by our actual test score that ISN'T the underlying thing our test is measuring
What is reliability in terms of the relationship between true and total test score variability?
-In Classical Test Theory, reliability is the ratio between the TRUE VARIABILITY (hypothetical distribution of test scores in a sample if no measurement error) and the TOTAL VARIABILITY (the actual distribution of test scores, which includes error). -Reliability is the proportion of the egg that is the yolk LOWER MEASUREMENT ERROR = HIGHER RELIABILITY
List the various sources of measurement error
1. Test construction e.g. domain sampling/item sampling/content sampling 2. Test administration e.g. distractions during test, fatigue, Red Bull poisoning, invigilator demeanour 3. Test scoring e.g. biased examiners, ambiguous scoring guidelines, technical errors 4. Other influences such as self-efficacy, motivational factors etc.
Law of Large Numbers
A large and diverse sample of evidence for someone's past behaviour is likely to be a good predictor of their future behaviour. A single example of someone's past behaviour will probably be a poor predictor of their future behaviour.
What is item /content sampling?
Also referred to as content sampling, the variety of the subject matter contained in the items; frequently referred to in the context of the variation between individual test items in a test or between test items in two or more tests
Internal consistency
Average correlation between the items on your scale. If all items on scale supposed to be measuring the same thing, do individuals give consistent responses across items? Can also be called INTER-ITEM consistency or INTERNAL COHERENCE.
Imagine you create an ability test which involves an examiners making a number of ratings of an individual's thumb-rolling skill. You test the inter-rater reliability using two examiners and obtain a correlation of 0.87. What does this mean
Both examiners have a high correlation of 0.87 it means they are similar in their ratings.
What is classical test theory?
Classical Test Theory is the traditional conceptual basis of psychometrics - it's also know as "True Score Theory" - it is the idea that every measurement we take (measured/observed test score) can be decomposed into two parts: 1. The true score (underlying thing that our test is measuring) 2. Measure error (everything captured by our actual test score that ISN'T the underlying thing our test is measuring
True of False According to Classical Test Theory, if a test has very high reliability then the measurement error must be very high
False
True or False 0.50 is usually considered the typical minimum threshold for reliability
False
True or False 0.90 is usually considered the typical minimum threshold for reliability
False
True or False If we double the number of items in a test, then the SB formula predicts that we should double it's reliability.
False
Classical Test Theory can be described by the formula X=T/E (where X is the observed score, T is the true score, and E is the measurement error)
False (X=T+E not X=T/E)
True or False True Score Theory involves conceptualising test score variability as comprising true test score variation and total test score variation
False (comprises true test score variation and measurement error)
True or False Reliability conceptualised in Classical Test Theory is being total test score variation minus measure error
False (ratio of true variability and total (observed) variability
As part of the process of calculating Cronbach's alpha, you have to adjust for the homogeneity of these by applying the Spearman-Brown formula.
False - it's not for homogeneity - we are adjusting for chopping tests in half
True or False Imagine you create an ability test which involves an examiner making a number of ratings on an individuals's acrobatic skill. You test the inter-rater reliability using two examiners and obtain a reliability of .88. This means on examiner's rating is independent of the other's rating.
False - that would mean there is no correlation
True or False As part of the process of calculating Cronbach's alpha, you have to multiply the correlations derived from all possible ways of splitting the test into two
False - work out correlations for each half possibility, then average all possibilities.
True or False The fact that we cannot ask students everything about the course in the quizzes will decrease the proportion of the observed score that can be attributed on the true score (assuming the quiz marks are supposed to reflect students' overall PSYC3020 knowledge)
False I think. It will cause the scores to vary depending on what questions are given
Types of reliability evaluation Internal consistency
How much the item scores in a test correlate with one another on average. Source of error: consistency of items within the same test (do they all measure the same thing?)
Types of reliability evaluation Inter-rater reliability
If a test involves an examiner making a rating - get two of them to do the rating independently and see how much their ratings correlate (also compare means and SDs of ratings if appropriate). Source of error: Different observers / examiners / raters recording outcomes.
1. Homogeneity/heterogeneity of the test
If measure HETEROGENEOUS then internal consistency might be an inappropriate estimate of reliability (though you could instead look at the internal consistency of each sub scale separately). HOMOGENEOUS test - if the test items all measure the same thing - uni dimensional HETEROGENEOUS test - if more than one independent thing is being measure (i.e. there are sub scales that don't intercorrelate highly).
Types of reliability evaluation Parallel-forms reliability
If people do two different versions of the same test, how much do their scores on the two versions correlate? Source of error: Item sampling (e.g. different items used to asses the same attribute)
Types of reliability evaluation Test-retest reliability
If people sit the same test twice, how much do their scores correlate between the two sittings? Source of error: Time sampling (does something that should be the same over time actually vary because of error?)
3. Restriction of range/variance
If scores in sample are inappropriately restricted in the amount they can vary then this will affect the correlation (and ALL our reliability estimates are based on correlations). This means we have to be careful in interpreting ANY of the reliability estimates (i.e. try and avoid have a restriction in the range of scores).
Inter-rater reliability
Inter-rater reliability can be measured by looking at the correlation between scores on the same test by the same people provided by 2 different examiners.
Statistic we can use to evaluate test-retest reliability
Jamovi - click on "regression" then "correlation matrix". Click on the variables to transfer them into the right hand box. (don't need Cronbach's alpha)
5. Criterion-referenced tests
May be very little variation in people's responses. e.g. in some pass/fail test virtually everyone might pass (e.g. driving certification) This is e.g. of RESTRICTION OF RANGE. If there's no variation then problem to use any of the reliability estimates as they are all derived from assessing score variability. No variability - can't do correlations
Cronbach's alpha
Measure for internal consistence
True or False If a test was unreliable then the true test score variability would only be a small proportion of the actual test score
Not sure??? True I think
Parallel Forms reliability
Parallel Forms (or Alternate Forms) reliability is the correlation between scores on 2 versions of the same test by the same people done at the same time. (don't need Cronbach's alpha)
Definition of reliability
Reliability 'refers to the accuracy, dependability, consistency, or repeatability of test results'
Reliability and the number of items in a test
Reliability increases with more items, decrease with less items. This is the effect of domain sampling (= item sampling = content sampling = the idea that in any test, you're only testing a sample of what could possibly be tested). With more items, you have more samples of the domain of interest, which means that the test score becomes a better representation of the "total domain" score. More items - more robust to effects of occasional atypical responses (e.g.accidentally clicking the wrong thing.
Why can we only estimate the reliability of a test and not measure it directly?
Reliability refers to the degree to which test scores are free from errors of measurement. Because true variance is hypothetical/theoretical, we can't actually measure it (or reliability) directly. Instead we estimate reliability via test-retest, parallel-forms, internal consistency, and/or inter-rater reliability.
Test-retest reliability
Rest-retest reliability is the correlation between scores on the same test by the same people done at two difference times.
4. Speed tests vs power tests
SPEED test - speed of response POWER test - level of difficulty of response (e.g. intelligence) Speed - internal consistency not appropriate (because people tend to get all the questions they attempt correct but they just don't have time to attempt all the Qs) - give spurious correlation. Speed - use parallel-forms or test-retest reliability
2. Static vs dynamic characteristics
STATIC - test measuring something that is static - meant to remain the same over time. (e.g. Intelligence) DYNAMIC - expect to change over time (e.g. fatigue, state anxiety) If dynamic - TEST-RETEST reliability - problem because assumes the thing being measure says the same
How to calculate Cronbach's alpha in Jamovi?
Select FACTOR then RELIABILITY ANALYSIS. Select all items in your scale and move them to the "Items" box.
Cronback's alpha - hand calculations
Step 1: Split questionnaire in half Step 2: Calculate total score for each half Step 3: Work out correlation between the total scores for each half Step 4: Repeat steps 1-3 for all possible two way splits of the questionnaire. Step 5: Work out average of all possible split-half correlations. Step 6: Adjust the correlation to account for the fact that you've shortened (halved) the test by applying a correction call the SPEARMAN-BROWN formula
What estimate of reliability to use when? Which measure of reliability to use depends on the circumstances (ideally, calculate as many estimates of reliability as possible).
The following situations can affect which reliability estimates you can use: 1. Homogeneity/heterogeneity of the test 2. Static vs dynamic characteristics 3. Restriction of range/variance. 4. Speed tests vs power tests 5. Criterion-referenced tests
True or False 0.70 is usually considered the typical minimum threshold for reliability.
True
True or False According to Classical Test Theory, if a test has very high reliability then the measure error must be very low
True
True or False As a part of the process of calculating Cronbach's alpha, you have to average the correlations derived from all possible ways of splitting the test into two.
True
True or False As part of the process of calculating Cronbach's alpha, you have to adjust for the fact you've halved the test by applying the Spearman-Brown formula.
True
True or False Imagine you create an ability test which involves an examiner making a number of ratings of an individual's acrobatic skill. You test the inter-rater reliability using two examiners and obtain a reliability -0.98. This means one examiner is giving the opposite rating to the other.
True
Test-retest reliability involving giving the same test twice. This might be a problem
Use parallel versions of the same test to get around this (different but equivalent stimuli) If doing this - good idea to counterbalance the two versions
The Spearman-Brown prediction formula
We can estimate how the reliability would change if your tests shortened or lengthened using the Spearman-Brown formula rsb=(n x rxx)/(1+((n-1) x rxx)
For inter-rater reliability, when might we want to examine the means and SD of two raters' ratings in addition to the correlation between them.
When it is a criterion referenced test (i.e. absolute values matter, e.g. tutors marking assignments), then we might want to examine means and SDs too.
X = T + E
X - Observed Score of total test score variation T - True Score (the 'real' score) or true test score variation E - Errors of Measurement (random variability in the test score data that is unrelated to the true score)
What is content sampling?
a small selection of all the possible questions
Higher reliability
if a person took the same test multiple time, we'd expect their scores to be less spread out due to measure error
Lower reliability
if a person took the same test multiple times, we'd expect their scores to be more spread out due to measurement error
Hand calculate Cronbach's alpha Step 6: SPEARMAN BROWN formula
rsb = 2rxx / 1 + rxx
What is domain sampling?
tests are constructed by randomly selecting a specified number of measures from a homogeneous, infinitely large pool. An item domain is a well-defined population of items from which one or more test forms may be constructed by selection of a sample of items from this population