PSY3041 Week 2 Reliability

¡Supera tus tareas y exámenes ahora con Quizwiz!

KR20 formula explanation

*Another method for evaluating reliability within a single test administration, without splitting it in half. *Simultaneously considers all possible ways of splitting a test's items. *Used for tests with yes/no, right/wrong answers. *To have non-zero reliability, the variance for the total test score must be greater than the sum of the variances for the individual terms; which only happens when the items are measuring the same trait. *Disadvantages: not appropriate for evaluating internal consistency in some cases (i.e., tests with no 'right' or 'wrong' answers).

Example of classical test theory

*Anxiety score on test = anxiety + error. *Eg: if you're having a really bad day, you might have really high levels of anxiety (higher than usual). Or, the test questions are ambiguous, or the room was too hot, etc.

Application of reliability and validity

*Diagnosis (the initial test used may not be valid). *Treatment (recommended based on test results/ monitored using tests). *Drawing conclusions (important clinically and in research).

Importance of reliability

*If a measure is not reliable, we can't trust what it tell us.

Correction for attenuation

*Potential correlations are attenuated (diminished/reduced) by measurement error. *Some methods allow one to estimate what the correlation between two measures would have been if they had not been measured with error, and then "correct" for attenuation in correlations caused by measurement error.

Reliability vs. validity

*Tests can be reliable without being valid. *Eg: you can keep getting the same result but it's not actually accurately measuring what you want to measure. *But, tests cannot be valid without being reliable.

Summary

*Tests that are relatively free of measurement error are considered reliable, and tests that contain relatively large measurement error are considered unreliable. *If we are concerned about errors that result from tests being given at different times, then we might consider the test-retest method in which test scores obtained at two different points in time are correlated. *If we are concerned about errors that arise because we have selected a small sample of items to represent a larger conceptual domain, we could use a method that assesses the internal consistency (split-half method, KR20, alpha coefficient). *The standard of reliability for a test depends on the situation in which the test will be used. *In some research settings, bringing a test up to an exceptionally high level of reliability may not be worth the extra time and money. *But, strict standards for reliability are required for a test used to make decisions that will affect people's lives. *When a test has unacceptably low reliability, the test constructor might wish to boost the reliability by: - Increasing the test length. - Using factor analysis to divide the test into homogeneous subgroups of items. *We can sometimes deal with the problem of low reliability by estimating what the correlation between tests would have been if there had been no measurement error this is called correction for attenuation. *Evaluating the reliability of behavioural observations is also an important challenge. - The percentage of items on which observers agree is not the best index of reliability for these studies because it does not take into consideration how much agreement is to be expected by chance alone. - Correlation-like indexes such as kappa or phi are better suited to estimate reliability in behavioural studies. *Reliability is one of the basic foundations of behavioural research; if a test is not reliable, then one cannot demonstrate that it has any meaning.

Increasing number of items

*The larger the sample, the more likely that the test will represent the true characteristic. *In domain sampling model, the reliability of a test increases as the number of items increases. *But, more items = long and costly process (test must be re-evaluated after new items are added. *Also can increase fatigue (another source of error). *So, test developers must ask whether an increase in reliability is worth the extra time, effort and expense required. *Can use Spearman-Brown prophecy formula.

Charles Spearman

*This theorist was responsible for advanced development of reliability assessment in 1904. *Put together the concepts of sampling error + moment correlation.

Classical test theory - reliability coefficients

*This theory conceptualises reliability as the ratio of true score variance to the variance of observed scores. *Reliability coefficient scores range from 0-1. *Scores closer to 1 represent better reliability (higher = better). *Calculating reliability coefficients relies heavily on the concepts of correlation and regression. *1.0 - reliability ratio = percentage of variation attributable to random error. *Eg: if you are given a test that will be used to select people for a job, and the reliability of the test is .40, this means that 40% of the variation/difference among applicants is explained b real life differences and 60% is attributed to random or chance factors.

Domain sampling model explanation

*This theory states that a person's true score could be obtained by having them respond to all items in the universe of items; but this is too difficult. *We can only see responses to a sample of items. *Eg: anxiety is a big concept, has lots of cognitions, emotions, behaviours attached, and when we make up a questionnaire, we're only sampling a few items that we think are relevant or that we want to pick up (we're just sampling is just a small number of behaviours). *So there is a problem using only a sample of items to represent a construct. *Each item in a test is an independent sample of the trait/ability being measured. *As a sample gets larger, it represents the domain more and more accurately (more items = more reliability). *Eg: because we can't ask someone to spell the whole dictionary (that would be their T score), we choose a sample of items (i.e., words) rather than the entire domain (i.e., the dictionary). For the best reliability, one should draw several different lists of words randomly from the dictionary and consider each sample to be an unbiased test; then find the correlation between each test and each other test.

Item response theory explanation

*This theory turns away from classical test theory because it requires that exactly the same test items be administered to every individual. *Items are graded in difficulty: the first item is slightly easier, then it gets harder, and when it's too hard, you stop the test, or move back to the area where the person is getting some right/ some wrong. This level of ability is then intensely sampled. *Eg: intelligence testing. *Eg: in measuring depression - first you ask if the individual is feeling sad, then you move onto hopelessness, then you move onto suicidal. *Advantage: a more reliable estimate of ability is obtained with a shorter test (fewer items). *Disadvantages: requires a bank of items that have been systematically evaluated for level of difficulty, complex computer software, test development.

Split-half method explanation

*Way of measuring reliability: psychologists evaluate the internal consistency of a test by dividing it into subcomponents (splitting it into halves) and then correlating the two halves. *Best to halve the test randomly with the odd-even system. *If scores on two half tests from one single administration are highly correlated, scores on two whole tests from separate administrations should also be highly correlated. *Useful in overcoming logistical difficulties of test-retest reliability. *But finding the correlation between the two halves to estimate reliability of the test isn't so accurate, because each half would be less reliable than the whole test. *So we use the Spearman-Brown formula, or Cronbach's alpha if the two halves have unequal variances. *Estimates of reliability based on this method are smallter than actual reliability scores due to the use of smaller number of items in the calculation (i.e., you're not measuring as many things). *Disadvantages: arbitrary decision about how to split the test; scoring each half separately creates additional work. *LOW EQUIVALENCE = LOW SPLIT-HALF RELIABILITY.

Parallel forms method (item sampling) explanation

*Way of measuring test reliability - two forms of a test use different items, but the rules used to select the items of a particular difficulty level are the same. *Two forms are matched for CONTENT and DIFFICULTY. *Relatively stable construct is still required. *The two forms are given to the same group of people; still looking for strong correlations between the two to indicate high reliability. *Also influenced by changes in between testing times (eg: earning, fatigue). *Additional source of error variance from item sampling; because there are two different tests, the items aren't identical. *Provides one of the most rigorous reliability assessment, but occurs less often than desirable in practice (because it is burdensome to develop two forms of the same test, so psychologists estimate reliability from one single group of items; one test form).

Test-retest method explanation

*Way of testing reliability - involves the same test being administered to the same group twice (at two different time points) and then finding the correlation between the the scores from the two administrations. *You may not get identical scores due to practice effects, maturation or treatment effects. *Look for strong correlations between the two scores (=high reliability). *THIS METHOD ONLY WORKS ON STABLE, UNCHANGING CONSTRUCTS! *Eg: appropriate for extraversion, intelligence. *Eg: inappropriate for state anxiety, baby weight, Rorschach test.

Sources of error

1. Item selection. 2. Test administration (eg: environmental conditions). 3. Test scoring (eg: with projective tests, essays). 4. Systematic measurement error. 5. Chance factors.

Maximising test-retest reliability

1. Measure stable construct. 2. Do not use any interventions in between testing. 3. Have shorter time between testing.

Three estimations of test reliability

1. Test-retest method. 2. Parallel forms method. 3. Internal consistency method.

Cronbach's coefficient alpha

A generalised reliability coefficient for scoring systems that are graded for each items (i.e., Likert: agree to disagree); tells us INTERNAL CONSISTENCY of a test. *Calculates the mean of all possible split-half correlations, corrected by Spearman-Brown formula. *Ranges from 0 (no similarity) to 1 (perfectly identical). *.70 - .80 is acceptable as good reliability. *More than .90 may indicate redundancy in items (waste of time refining measures to get >.90). *But in clinical research >.95 is needed. *Can tell you if the test has substantial reliability, but can't tell you if the test is unreliable. *Disadvantage: skewness can affect average correlation among items. *Still most commonly used reliability index.

Pearson correlation coefficient

A method used to estimate reliability.

Factor and item analysis

A popular method for dealing with the situation in which a test apparently measures several different characteristics. *Used to divide items into subgroups, each one internally consistent but not related to one another. *Very useful in test construction process. *Leave items out that do not measure the given construct, or that measure more than one characteristic. *Tests are most reliable if they are unidimensional.

Difference score

A score created by subtracting one test score from another. *Eg: the difference between performances at two points in time (before and after a special training program). *Comparisons must be made in Z units (standardised). *Reliability of a difference scores is expected to be LOWER than the reliability of either score on which it is based. *If two tests measure exactly the same trait, the difference score should have a reliability of 0. *Low reliability of a difference score is concerning because it can't be depended on for interpreting patterns. *Eg: it's hard to conclude that a patient is more depressed than schizophrenic because they show a lower depression score on the MMPI than their schizophrenia score.

Standard error of measurement

Allows estimation of the precision of a specific (individual) test score; the degree to which a test provides inaccurate readings. *The larger the SEM, the less certain we are that the test represents the true score. *Bigger SEM = less accuracy of attribute measurement. *Smaller SEM = an individual's score is probably close tot he measured value. *Used to create confidence intervals. *Eg: IQ = 100, CI (95%) = 90-110. *Larger SEM = larger confidence interval. *Most useful index of reliability for the interpretation of individual scores. SUMMARY: SEM creates an interval around an observed score. The WIDER the interval, the LOWER the reliability of the score Using the SEM, we can say that we are 95% confident that a person's true score falls between these two values.

Confidence intervals

Because we never know if an observed score is the "true" score, we can form intervals around the observed score and use statistical procedures to estimate the probability that the true score falls within a certain interval. *Common intervals: 68%, 95%, 99%. *Eg: even if we don't know the true score for a person who received 106, we can be 95% confident that their true score falls between 96.9 and 115.1. *Wider confidence intervals limit our ability to make precise statements.

Kappa statistic

Best method for assessing level of agreement among several observers. *Indicates the actual agreement as a proportion of the potential agreement following correction for chance agreement. *1 = perfect agreement. *-1 = less agreement than expected from chance alone. *> .75 = excellent agreement. *.40 - .75 = fair agreement. *< .40 = poor agreement.

Stability over time

Central problem is that the interpretation of individual scores changes when a test is administered on more than one occasion.

Standard error of measurement in classical theory

Classical test theory uses the standard deviations of errors as the basic measurement of error. *This tells us on average how much a score varies from a true score. *Estimated through the standard deviation of an observed score/ test reliability.

Equivalent forms reliability

Comparing performance on one form of a test vs. another (same as parallel forms).

Parallel forms method (item sampling) definition

Comparing two equivalent forms of a test that measure the same attribute.

Domain sampling model definition

Considers the problems created by a using limited number of items (eg: a sample of items) to represent a larger and more complicated construct (central concept to classical test theory). *Thus, a domain is defined that represents a single trait or ability, and each item is an individual sample of this general trait/ability. —> If the items don't measure the same characteristic. the test is not internally consistent (i.e., reliable).

Test-retest method definition

Consistency of test results when administered on different occasions.

Split-half method definition

Dividing a test into halves that are scored separately, and results of one half are compared with results of the other. *Used to measure internal consistency.

Coefficient omega

Estimates the extent to which all items measure the same underlying trait (used to correct skewness of Cronbach's alpha).

Discriminability analysis

Examining the correlation between each item and the total score for the test. *When the correlation between the performance on a single item and the total test score is low, the item is probably measuring something different from the other items on the test. *Or, the item is so easy/ so hard that people don't differ in their response to it. *Thus, low correlation indicates that this item drags down the estimate of reliability of the whole test, and should be excluded.

True score

Factors that contribute to consistency (stable attributes under examination). *To find this, one must find the mean of the observations from repeated applications. *It's the middle vertical line in the distribution.

Errors of measurement

Factors that contribute to inconsistency (characteristics of test taker, test or situation that have nothing to do with the attribute being testing but still affect the scores). *There will always be some inaccuracy in our measurements, but this does not imply that a mistake has been made. *These inaccuracies are random, and the distribution is bell-shaped. *The centre of the distribution should represent the true score, and the dispersion around it is the sampling errors. *The more dispersion, the more error. *The standard deviation of the distribution tells us about the magnitude of the measurement error *Distribution of random errors is assumed to be the same for all people.

Systematic measurement error

If a test consistently taps into something other than the attribute being tested.

Test score variance

Item variances + covariances between items.

Unidimensional

One factor should account for considerably more of the variance than any other factor.

Reliability in behavioural observation studies

Psychologists with behavioural orientations usually favour direct observation of behaviour over psychological tests. *Eg: recording the number of time a child kicks another child, to measure their aggression, and then tabulating it. *Problem: many sources of error (as behaviour can't be monitored continuously). *Behavioural observations are FREQUENTLY UNRELIABLE because of discrepancies between true scores and scores recorded by an observer. *Solution: estimate the reliability of the observers (inter-rater reliability) eg: how much they agree. *Most common method, but not the best method, as percentage does not consider the level of agreement expected by chance alone. *Percentages should not be mathematically manipulated; rather use Z scores.

Reliability definition

The degree to which a test tool provides consistent results (when measured again and again). *Tests that are relatively free of measurement error. *Depends on the extent to which all items measure one common characteristic. *Estimated from the correlation between an observed score (X) with a true score (T).

Internal consistency

The extent to which a psychological test is homogeneous or heterogenous. *Looks at whether items in a test measure the same thing or two separate things. *Where Cronbach's alpha comes from.

KR20 formula definition

The formula for calculating the reliability of a test in which the items are DICHOTOMOUS (scored 0 or 1 for right or wrong).

Coefficient alpha

The most general method of finding estimates of reliability through internal consistency. *Most general reliability coefficient, because it describes the variance of items regardless of whether or not they're in the right-wrong format. *Different to KR20 because it expresses variance of items differently.

Difficulty

The percentage of test-takers who pass the item.

Item selection problem

The sample of items chosen on a test may not be equally reflective of every individual's score. *Eg: if a test emphasises obscure material, well-prepared students may do very badly but poorly prepared students may do very well. *For each student there would be a large amount of error.

Spearman-Brown formula

This allows estimation of what the correlation between two halves of a test would have been if each half had been the length of the whole test; increases the estimate of reliability. *Allows estimation of internal consistency if the test was longer or shorter.

Spearman-Brown prophecy formula

This formula allows one to estimate how many items will have to be added in order to bring a test to an acceptable level of reliability.

KR21

This formula uses an approximation of the mean test score. *All items are of equal difficulty, or the average difficulty level is 50%.

Classical test theory/ test score theory (Charles Spearman)

This theory says that each person has a true score that would be obtained if there were no errors in measurement, but this never occurs because measurement instruments are imperfect. Thus, the results on a test are a combination of the true score + the errors of measurement. *There is always a difference between a person's observed score and their true ability/characteristic. *Formula: X = T + e. *X = obtained score. *T = true score. *e = errors of measurement. **Smaller random error = higher reliability.

Item response theory definition

This theory shifts the focus to the individual items themselves, and focuses on the range of item difficulty that helps assess individual ability level.

Covariance

When items are correlated with each other (i.e., measure the same general trait).

Practice effects

When skills improve with practice; when a test-taker scores better the second time because they have sharpened their skills. *Important type of carryover effect.

Carryover effect

When the first testing session influences scores from the second session. *In this case, test-retest correlation usually overestimates true reliability. *But if these effects involve systematic changes, they don't harm reliability; only harmful when they're RANDOM. *Also, if something affects all test-takers equally, no net error occurs. *The closer together the tests, the higher the likelihood of carryover effects; so must select intervals between testing sessions carefully.

Maximising reliability

▪ Clear administration and scoring instructions for test user. ▪ Clear instructions for the test taker. ▪ Unambiguous test items. ▪ Standardised testing environment and procedure. ▪ Reduced time between testing sessions. ▪ Increase assessment length/items. ▪ Test try-out and modification. ▪ Discarding items that decrease reliability (item analysis). ▪ Maximise VALIDITY.


Conjuntos de estudio relacionados

Henry David Thoreau - "Civil Disobedience"

View Set

Managerial Accounting Ch 4 Presentation

View Set

Introduction to College Accounting Chapter #1 (21st edition Heintz | Parry)

View Set

Microbiology Final Exam: principles of disease

View Set

APES: Air Pollution and Climate Change

View Set

Life and Health: Guaranteed Missed Questions

View Set