What is Test Reliability/Precision Part 1

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

If it is reliable is it valid?

Just because a test has been shown to produce reliable scores, that does not mean the test is also valid. - evidence of reliability/ precision does not mean that the inferences that a test user makes from the scores on the test are correct or that the test is being used properly

Recall that according to classical test theory, observed-score (X) variance is composed of two parts:

Part of the variance in the observed scores will be attributable to the variance in the true scores (T), and part will be attributable to the variance added by measurement error (E). - therefore if observed-score variance were equal to true-score variance, this would mean that there is no measurement error, and so, using the formula, the reliability coefficient in this case would be 1.00 - any time observed-score variance is greater than true-score variance (which is always the case because of the presence of measurement error), the reliability coefficient will become less than 1

Is the concepts of internal consistency and homogeneity the same?

They are not the same Coefficient alpha describes the extent to which questions on a test or sub- scale are interrelated - homogeneity refers to whether the questions measure the same trait or dimension - it is possible for a test to contain questions that are highly interrelated, even though the questions measure two different dimensions. - therefore, a high coefficient alpha is not proof that a test measures only one trait, skill, or dimension

Parallel forms

Two forms of a test that are alike in every way except questions; used to overcome problems such as practice effects; also referred to as alternate forms. - we sometimes use to describe different forms of the same test - the term parallel forms refers to two tests that have certain identical (and hard to achieve) statistical properties.

Reliability coefficient

When referring to the results of the statistical evaluation of reliability the term reliability coefficient is preferred

Classical test theory

a person's test score (called the observed score) is made up of two independent parts - the first part is a measure of the amount of the attribute that the test is designed to measure. This is known as the person's true score (T) - the second part of an observed test score consists of random errors that occur anytime a person takes a test (E). - it is this random error that causes a person's test score to change from one administration of a test to the next (assuming that his or her true score hasn't changed)

Correlation

a statistical procedure that provides an index of the strength and relationship between two sets of score, a statistic that describes the relationship between two distributions of scores - this method of estimating reliability allows us to examine the stability of test scores over time and provides an estimate of the test's reliability/ precision

Test-retest method

a test developer gives the same test to the same group of test takers on two different occasions. - the scores from the first and second administrations are then compared using correlation

Is there such thing as perfectly reliable?

although we can measure some things with great precision, no measurement instrument is perfectly reliable or consistent - for example, clocks run slow or fast - even if we measure their errors in microseconds.

What about errors made by the person who scores the test?

an individual can make mistakes in scoring, which add error to test scores, particularly when the scorer must make judgements about whether an answer is right or wrong. - when scoring requires making judgements two or more persons should score the test - the methods we have already discussed pertain to whether the test itself consistent scores, but scorer reliability and agreement pertain to how consistent the judgement of the scores are

A true score on a test:

an individual's true score (T) on a test is a value that can never really be known or determined. - it represents the score that would be obtained if that individual took a test an infinite number of times and then the average score across all the testing was computed - random error that may occur in any one testing occasion will actually cancel themselves out over an infinite number of test occasions - therefore, if we could average all the score together, the result would represent a score that no longer contained any random error - this average is the true score on the test and represents the amount of the attribute the person who took the test actually possesses without any random measurement error

Example of Alternate forms in testing

can be seen in the development of the Test of Nonverbal Intelligence, fourth edition - is an intelligence test designed to assess cognitive ability in populations that have language difficulties due to learning disabilities, speech problems, or other verbal problems that might result from a neurological deficit or developmental disability - after the forms were developed, the test developers assessed the alternate-forms reliability by giving the two forms to the same group of subjects in the same testing sessions - the results demonstrated that the correlation bewteen the test forms (which os the reliability coefficient) across all ages was .81 and the mean score difference between the two forms was one half of a score point

Order effects

changes in test scores resulting from the order in which the tests were taken - to guard against these half of the test takers may receive Form A first and then other half may receive Form B first

Reliability/ Precision

describe the consistency of test scores. - All test scores - just like any other measurement - contain some error. It is this error that affects the reliability, or consistency of test scores. - when referring to the consistency of test scores in general, the term reliability/ precision is preferred - reliability/ precision is one of the most important standards for determining how trustworthy data derived from a psychological test are

What does each method produce?

each method produces a numerical reliability coefficient, which enables us to estimate and evaluate the reliability/ precision of the test

Classical test theory formula

expresses these ideas by saying that any observed test score (X) is made up of the sum of two elements: a true score (T) and random error (E). X = T+E

Differences between random error and systematic error proposed by Nunnally

if a chemist uses a thermometer that always reads 2 degrees warmer than the actual temperature, the error that results is systematic, and the chemist can predict the error and take it into account - if the chemist is nearsighted and reads the thermometer with a different amount and direction of inaccuracy each time, the readings will be wrong and the inconsistencies will be unpredictable, or random

Internal Consistency Method

is a measure of how related the items (or groups of items) on the test are to one another - another way to think about this is whether knowledge of how a person answered one item on the test would give information that would help you correctly predict how he or she answered an other item on the test. - if you can (statistically) do that across the entire test, then the items must have something in common with each other - that commonality is usually related to the fact that they are measuring a similar attribute, and therefore we say that the test is internally consistent

A yardstick

is a reliable measuring instrument over time because each time it measures an object (e.g. a room), it gives approximately the same answer. - variations in the measurements of the room - perhaps a fraction of an inch from time to time can be referred to as measurement error - such errors are probably due to random mistakes or inconsistencies of the person using the yardstick or because small increment on a yardstick is often a quarter of an inch, making finer distinctions difficult - a yardstick also have internal consistency, the foot on the yardstick is the same length as the second and third foot, and the length of every inch is uniform

How do you define Random Error

is defined as the difference between a person's actual score on a test (the observed score) and the person's true score (T). - because this source of error is random in nature, sometimes a person's observed score will be higher than his or her true score and sometimes the observed score will be lower than his or her true score. - in any single test administration, we can never know whether random error has led to an observed score that is higher or lower than the true score

Reliable test

is one we can trust to measure each person in approximately the same way every time it is used - a test also must be reliable if it is used to measure attributes and compare people, much as a yardstick is used to measure and compare rooms.

A good example of test-retest reliability

is seen in the initial reliability testing of the Personality Assessment Inventory (PAI) - the PAI developed by Leslie Morey, is used for clinical diagnoses, treatment planning, and screening for clinical psychopathology in adults. - researchers administered the PAI twice to 75 normal adults and the second administration followed the first by an average of 24 days. (and among 80 college students) - in each case, the researchers correlated the set of scores from the first administration with the set of score from the second administration. The two studies yielded similar results, showing acceptable estimates of test-retest reliability for the PAI

Additional distinction between random error and systematic error

is that random error lowers the reliability of a test. Systematic error does not; the test is reliably inaccurate by the same amount each time. This concept will become apparent when we begin calculating reliability/ precision using correlation

Limitation in using test-retest

is that the test takers may score differently (usually higher) on the test because of practice effects

Danger when using alternate forms?

is that the two forms will not be truly equivalent. - alternate forms are much easier to develop for well-defined characteristics, such as mathematical ability, than for personality traits, such as extroversion - although we check the reliability of alternate forms by administering them at the same time, their practical advantage is that they can also be used as pre- and posttests if desired

What is an even better way to measure internal consistency?

is to compare individual's scores on all possible ways of splitting the test into halves. - this method compensates for any error introduced by any unintentional lack of equivalence that splitting a test in the two halves might create - we can also use KR20 (right or wrong answers) and coefficient alpha (more than one response)

How do we know what reliability/precision test we do?

it depends on the test itself and the conditions under which the test user plans to administer the test - the test developer should report the reliability method as well as the number or and characteristics of the test taken in the reliability study along with the associated reliability coefficients. - for some tests such as the PAI, the WCST, and they Bayley Scale, more than one method may be appropriate - Each method provides evidence that the test is consistent under certain circumstances

What does scorer reliability help us do?

it ensures that the instructions for scoring are clear and unambiguous so that multiple scorers arrive at the same result

Is it appropriate to calculate an overall estimate of internal consistency?

it is not appropriate to calculate an overall estimate of internal consistency when a test is heterogeneous. - instead, the test developer should calculate and report an estimate of internal consistency for each homogeneous subtest or factor

What are other alternatives then administering an infante number of tests.

it turns out that making a test longer also reduces the influence of random error on the test score for the same reason - the random error component will be more likely to cancel itself out

Heterogeneous test

measuring more than one trait or characteristic - estimates of internal consistency are likely to be lower - for example, a test for people who are applying for the job of accountant may measure knowledge of accounting principles, calculation skills, and ability to use a computer spreadsheet - such a test is heterogeneous because it measures distinct factors of performance for an accountant

Homogeneous test

measuring only one trait or characteristic - estimating reliability using methods of internal consistency is appropriate only for a homogeneous test

Practice effects

occurs when test takers benefit from taking the test the first time (practice), which enables them to solve problems more quickly and correctly the second time - therefore, the test-retest method is appropriate only when test takers are not likely to learn something the first time they take the test that can affect their scores on the second administration or when the interval between the two administrations is long enough to prevent practice effects

Alternate-forms methods

psychologists often give two forms of the same test-designed to be as much alike as possible - to the same people - this strategy requires the test developer to create two different forms of the test that are referred to as "alternate forms." - Again the sets of scores from the two tests are compared used correlation - the two forms (form A and B) are administered as close in time as possible - usually on the same day.

In the real world

random measurement error affects each individual's score in an unpredictable and different fashion every time he or she answers a question on a test - you can never predict the impact that the error will have on an individual's observed test score, and it will be different for each person as well -> the nature of natural error - the presence of random error will always cause the variance of a set of scores to increase over what it was if there were measurement error

Why does random error reduces the reliability of a test is because?

reliability is about estimating the proportion of variability in a set of observed test scores that is attributable only to true scores - in classical test theory, reliability is defined as true-score variance divided by total observed-score variance rxx = reliability O2t = true-score variance O2x = observed-score variance

When using a test that requires making judgements what is given with it?

schemes for which test manuals provide the explicit instructions necessary for making these scoring judgement - deviation from the scoring instructions or a variation in the interpretation of the instructions introduces error into the final score

Split-half method

statisticians have developed several methods for measuring the internal consistency of a test - this method divide the test into halves and then compare the set of individual test scores on the first half with the set of individual test scores on the second half. - the two halves must be equivalent in length and content for this method to yield an accurate estimate of reliability

An example using scorer reliability

the Wisconsin Card Sorting Test (WCST) is used to assess executive functioning of children and adults - in the test the data was score independently according to instructions given in an early edition of the test manual - their agreement was measured using a statistical procedure called intraclass correlation appropriate for comparing responses of more than two raters or of more than two sets of scores

Scorer reliability or interscorer agreement

the amount of consistency among scores' judgements - becomes an important consideration for tests that require decisions by the administrator or scorer

What is the best way to split the scores?

the best way to divide the test is to use random assignment to place each question in one half or the other. - random assignment is likely to balance errors in the score that can result from order effects, difficulty, and content - when we use the split method to calculate a relaibality coefficient, we are in effect correlating the scores on two shorter versions of the test. - However, as mentioned earlier, shortening a test decreases its reliability because there will be less opportunity for random measurement error to cancel itself out. - therefore, when using the split-half method we must mathematically adjust the reliability coefficient to compensate for the impact of splitting the test into halves - one way to account for this adjustment is using the Spearman - Brown formula

What is the time given between two test given?

the interval between the two administrations of the test may vary from a few hours to several years. - as the interval lengthens, test-retest reliability will decline because the number of opportunities for the test takers or the testing situation to change increases over time - when test developers or researchers report test-retest reliability, they must also state the length of time that elapsed between the two test administrations

How do they cancel out?

the mean or average of all the error scores over an infinite number of testings will be zero. This is why random error actually cancels itself out over repeated testings. - two other important characteristics of measurement error is that it is normally distributed, and it is uncorrelated with (or independent of) true scores.

True score

the score that would be obtained if an individual took a test an infinite number of times and then the average score across all the testing were computed

Random error

the unexplained difference between a test taker's true score and the obtained score; error that is nonsystematic and unpredictable, resulting from the unknown cause - because this type of error is a random event, sometimes it causes an individual's test score to go up on the second administration, and sometimes is causes it to go down

Systematic error is often difficult to identify

two problems we discuss later in this chapter - practice effects and order effects - can add systematic error as well as random error to test scores. For instance, if test takers learn the answer to a question in the first test administration (practice effect) or can derive the answer from a previous question (order effect), more people will get the question right

What are the assumptions when using test-retest reliability?

using test-retest reliability, the assumption is that the test takers have not changed between the first administration and the second administration in terms of the skill or quality measured by the test - the test takers mood or fatigue might change when taking the tests and the circumstances under which the test is administered, such as the test instructions, lighting, or distractions, must be alike. Any differences in administration or in the individuals themselves will introduce error and reduce reliability/ precision

Measurement error

variation or inconsistencies in the measurements yielded by a test or survey - error in measurement can be defined as the difference between a person's observed score and his or her true score

We can never actually know what the true scores on a test actually are

we can only estimate them using observed scores and that is why we always refer to calculated reliability coefficients as estimates of a test's reliability

When a measure comes back weird what is something we can do ?

we can remeasure to check psychological measurements. - these strategies establish evidence of the reliability/ precision of test scores

Systematic Error

when a single source of error always increases or decreases the true score by the same amount For instance, if you know that the scale in your bathroom regularly adds 3 pounds to to anyone's weight, you can simply subtract 3 pounds from whatever scale says to get your true weight. This error your scale makes is predictable and systematic

Intrascorer reliability

whether each clinician was consistent in the way he or she assigned scores from test to test

In a world of perfect scores:

with no measurement errors, we would expect that everyone's observed scores on the two parallel tests would be exactly the same. In effect both sets of test scores would simply be a measure of each individuals true score on the construct the test was measuring - if this was the case, the correlation between the two sets of test scores, which we call the reliability coefficient, would be a perfect 1.0, and we would say that the test is perfectly reliable

The formal relationship between reliability precision and random measurement error:

From classical test theory: any score that a person makes on a test (his or her observed score X) is composed of two component, his or her true score, T, and random measurement error, E. X = T + E - if we gave these two forms of the test to the same group of people, we would still not expect that everyone would score exactly the same on the second administration of the test as they did on the first. This is because there will always be some measurement error that influences everyone's scores in a random, non-predictable fashion - if the tests were really measuring the same concepts in the same way, we would expect people's scores to be very similar across the two testing sessions. And the more similar the scores are, the better the reliability/ precision of the test would be

Summary of what we learned

Test-retest, alternate forms, an internal consistency are concerned with the test itself. - scorer reliability involves an examination of how consistently the person or persons scored the test - that is why test publishers may need to report multiple reliability coefficients for a test to give the test user a complete picture of the instrument

What are the three categories of reliability coefficients used to evaluate the reliability/ precision of test scores?

The standards for Educational and Psychological Testing recognized these three: 1. test-retest methods 2. alternate-form method 3. internal consistency method ( split-half, coefficient alpha methods) - and methods that evaluate scorer reliability or agreement

Psychological tests are measurement instruments

They are no different from yardsticks, speedometers, or thermometers. - a psychological test measures how much the test taker has of whatever skill or quality the test measures. - for instance a driving test measures how well the test taker drives a car, and a self-esteem test measures whether the test taker's self-esteem is low or high or average. - the most important attribute of a measurement instrument is its reliability/ precision


Kaugnay na mga set ng pag-aaral

BI 213 Exam: Chapter 9.2: Transcription Factors in Eukaryotes

View Set

Chapter 8: Skeletal System - Axial and Appendicular Skeleton

View Set