PSYCH ASSESS - CH. 5 (RELIABILITY)
Random error
Sometimes referred to as "noise," this source of error fluctuates from one testing situation to another with no discernible pattern that would systematically raise or lower scores.
false - An estimate of test-retest reliability may be most appropriate in gauging the reliability of tests that employ outcome measures such as reaction time or perceptual judgments
T OR F: An estimate of test-retest reliability may be most appropriate in gauging the validity of tests that employ outcome measures such as reaction time or perceptual judgments
TRUE
T OR F: If we are to come to proper conclusions about the reliability of the measuring instrument, evaluation of a test-retest reliability estimate must extend to a consideration of possible intervening factors between test administrations.
false - In everyday conversation, reliability is a synonym for dependability or consistency.
T OR F: In everyday conversation, validity is a synonym for dependability or consistency.
false - External to the test environment in a global sense, the events of the day may also serve as a source of error.
T OR F: It is unlikely that the events of the day may serve as a source of error.
KR-21
The ____________ formula may be used if there is reason to assume that all the test items have approximately the same degree of difficulty.
test-retest measure
The _______________ is appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time, such as a personality trait
decision study
The ________________ is designed to tell the test user how test scores should be used and how dependable those scores are as a basis for decisions, depending on the context of their use
standard error of the difference
The _________________ between two scores will be larger than the standard error of measurement for either score alone because the former is affected by measurement error in both scores
reliability coefficient
The __________________ helps the test developer build an adequate measuring instrument, and it helps the test user select a suitable test
standard error of measurement
The __________________ is the tool used to estimate or infer the extent to which an observed score deviates from a true score
Spearman-Brown
The ____________________ formula allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test
split-half reliability
The computation of a coefficient of ____________ generally entails three steps: Step 1. Divide the test into equivalent halves. Step 2. Calculate a Pearson r between scores on the two halves of the test. Step 3. Adjust the half-test reliability using the Spearman-Brown formula (discussed shortly).
coefficient of equivalence.
The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability, which is often termed the ______________________
coefficients of generalizability
The influence of particular facets on the test score is represented by . These coefficients are similar to reliability coefficients in the true score model
homogeneous
The more ______________ a test is, the more inter-item consistency it can be expected to have
reliability
The term _________________ refers to the proportion of the total variance attributed to true variance
parallel forms reliability
The term ___________________ refers to an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means and variances of observed test scores are equal.
alternate forms reliability
The term ____________________ refers to an estimate of the extent to which these different forms of the same test have been affected by item sampling error, or other error.
dichotomous test items
(test items or questions that can be answered with only one of two alternative responses, such as true-false, yes-no, or correct-incorrect questions
polytomous test items
(test items or questions with three or more alternative responses, where only one is scored correct or scored as being consistent with a targeted trait or other construct).
systematic error
(what type of error) a 12-inch ruler may be found to be, in actuality, a tenth of one inch longer than 12 inches. All of the 12-inch measurements previously taken with that ruler were systematically off by one-tenth of an inch; that is, anything measured to be exactly 12 inches with that ruler was, in reality, 12 and one-tenth inches
reliability coefficient
. A ____________________ is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance.
Spearman-Brown formula
A __________ could also be used to determine the number of items needed to attain a desired level of reliability. In adding items to increase test reliability to a desired level, the rule is that the new items must be equivalent in content and difficulty so that the longer test still measures what the original test measured.
domain
A _____________ of behavior, or the universe of items that could conceivably measure that behavior, can be thought of as a hypothetical construct: one that shares certain characteristics with (and is measured by) the sample of items that make up the test
criterion-referenced test
A ________________ is designed to provide an indication of where a testtaker stands with respect to some variable or criterion, such as an educational or a vocational objective
heterogeneous
A __________________ (or nonhomogeneous) test is composed of items that measure more than one trait.
generalizability study
A __________________ examines how generalizable scores from a particular test are if the test is administered in different situations
dynamic characteristic
A __________________- is a trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences
universe score
According to generalizability theory, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained. This test score is the __________________
decision study
After the generalizability study is done, Cronbach et al. (1972) recommended that test developers do a ______________, which involves the application of information from the generalizability study
homogeneity; homogeneous
An index of inter-item consistency, in turn, is useful in assessing the of the test. Tests are said to be _____________ if they contain items that measure a single trait
reliability
And whereas in everyday conversation ______________ always connotes something positive, in the psychometric sense it really only refers to something that is consistent—not necessarily consistently good or bad, but simply consistent
odd-even reliability
Another acceptable way to split a test is to assign odd-numbered items to one half of the test and even-numbered items to the other half. This method yields an estimate of split-half reliability that is also referred to as ______________________
criterion-referenced tests
Unlike norm-referenced tests, _______________ tend to contain material that has been mastered in hierarchical fashion
reliability
Broadly speaking, in the language of psychometrics ___________________ refers to consistency in measurement.
speed test
By contrast, a ______________ generally contains items of uniform level of difficulty (typically uniformly low) so that, when given generous time limits, all testtakers should be able to complete all the test items correctly.
Spearman- Brown formula
By determining the reliability of one half of a test, a test developer can use the _______________ formula to estimate the reliability of a whole test
reliability coefficient
By employing the ___________________ in the formula for the standard error of measurement, the test user now has another descriptive statistic relevant to test interpretation, this one useful in estimating the precision of a particular test score
true variance; error variance.
Variance from true differences is ___________, and variance from irrelevant, random sources is ___________
, inter-scorer reliability
Variously referred to as scorer reliability, judge reliability, observer reliability, and inter-rater reliability, ________________ is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure
standard error of measurement
We may define the_____________________ as the standard deviation of a theoretically normal distribution of test scores obtained by one person on equivalent test
standard error of the difference
Comparisons between scores are made using the ________________________, a statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant.
power test
When a time limit is long enough to allow testtakers to attempt all items, and if some items are so difficult that no testtaker is able to obtain a perfect score, then the test is a _______________
coefficient of stability
When the interval between testing is greater than six months, the estimate of test-retest reliability is often referred to as the ______________________
generalizability theory
Developed by Lee J. Cronbach (1970) and his colleagues (Cronbach et al., 1972), _____________ is based on the idea that a person's test scores vary from testing to testing because of variables in the testing situation
Random error
Examples of random error that could conceivably affect test scores range from unanticipated events happening in the immediate vicinity of the test environment to unanticipated physical events happening within the testtaker
confidence interval
Further, the standard error of measurement is useful in establishing what is called a ________________: a range or band of test scores that is likely to contain the true score.
homogeneity
___________ (derived from the Greek words homos, meaning "same," and genos, meaning "kind") is the degree to which a test measures a single factor. In other words, ____________ is the extent to which items in a scale are unifactorial.
domain sampling theory,
____________ seek to estimate the extent to which specific sources of variation under defined conditions are contributing to the test score.
heterogeneity
_____________ describes the degree to which a test measures different factors
alternate-forms reliability
In _______________, a reliability estimate is based on the correlation between the two total scores on the two forms.
domain sampling theory,
In _______________, a test's reliability is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample
split-half reliability
In ________________, a reliability estimate is based on the correlation between scores on two halves of the test and is then adjusted using the Spearman-Brown formula to obtain a reliability estimate of the whole test
coefficient alpha
In contrast to KR-20, which is appropriately used only on tests with dichotomous items, ______________ is appropriate for use on tests containing nondichotomous items
measurement error
In general, the term _______________ refers to, collectively, all of the factors associated with the process of measuring some variable, other than the variable being measured
parallel forms
In theory, the means of scores obtained on _______________ correlate equally with the true score. More practically, scores obtained on _____________ correlate equally with other measures.
static characteristic
In this instance, obtained measurement would not be expected to vary significantly as a function of time, and either the test-retest or the alternate-forms method would be appropriate
split-half reliability
It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice (because of factors such as time or expense).
systematic or random
Measurement error, much like error in general, can be categorized as being either _____________ or ____________
Pearson
a _____________ r may be thought of as dealing conceptually with both dissimilarity and similarity. Accordingly, an r value of −1 may be thought of as indicating "perfect dissimilarity
mini-parallel-forms
a primary objective in splitting a test in half for the purpose of obtaining a split-half reliability estimate is to create what might be called "____________________," with each half equal to the other—or as nearly equal as humanly possible—in format, stylistic, statistical, and related aspects.
transient error,
a source of error attributable to variations in the testtaker's feelings, moods, or mental state over time.
domain sampling theory
and is better known today in one of its many modified forms as generalizability theory.
Examiner-related variable
are potential sources of error variance. The examiner's physical appearance and demeanor—even the presence or absence of an examiner—are some factors for consideration here.
Alternate forms
are simply different versions of a test that have been constructed so as to be parallel. Although they do not meet the requirements for the legitimate designation "parallel," ________________ of a test are typically designed to be equivalent with respect to variables such as content and level of difficulty
item sampling or content sampling
One source of variance during test construction is __________ or _____________, terms that refer to variation among items within a test as well as to variation among items between tests
testtaker variable
Other potential sources of error variance during test administration are ____________________. Pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication can all be sources of error variance
assessment
Potential sources of nonsystematic error in such an ____________ situation include forgetting, failing to notice abusive behavior, and misunderstanding instructions regarding reporting.
Average proportional distance
Rather than focusing on similarity between scores on items of a test (as do split-half methods and Cronbach's alpha), the APD is a measure that focuses on the degree of difference that exists between item scores
homogeneous
Recall that a test is said to be ___________ in items if it is functionally uniform throughout. Tests designed to measure one factor, such as one ability or one trait, are expected to be __________ in items.
test-retest reliability
Recall that a_____________ estimate is based on the correlation between the total scores on two administrations of the same test
decision study,
developers examine the usefulness of test scores in helping the test user make decisions.
internal consistency estimate of reliability or as an estimate of inter-item consistency
evaluation of the internal consistency of the test items
sampling error
ex. the extent to which the population of voters in the study actually was representative of voters in the election
false - An estimate of the reliability of a test can be obtained without developing an alternate form of the test and without having to administer the test twice to the same people
false - An estimate of the reliability of a test cannot be obtained without developing an alternate form of the test and without having to administer the test twice to the same people
alternate-forms or parallel-forms
if you have ever wondered whether the two forms of the test were really equivalent, you have wondered about the _____________ or _______________ reliability of the test.
generalizability theory
instead of conceiving of all variability in a person's scores as error, Cronbach encouraged test developers and researchers to describe the details of the particular test situation or universe leading to a specific test score.
Rasch model
is a reference to an IRT model with very specific assumptions about the underlying distribution.
. Random error
is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process.
Test-retest reliability
is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test.
standard error of measurement
is an index of the extent to which one individual's scores vary over tests presumed to be parallel. In accordance with the true score model, an obtained test score represents one point in the theoretical distribution of scores the testtaker could have obtained.
split-half reliability
is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.
Inter-scorer reliability
is often used when coding nonverbal behavior. For example, a researcher who wishes to quantify some aspect of nonverbal behavior, such as depressed mood, would start by composing a checklist of behaviors that constitute depressed mood
KR-20 (Kuder-Richardson formula 20)
is the statistic of choice for determining the inter-item consistency of dichotomous items, primarily those items that can be scored right or wrong (such as multiple-choice items).
item response theory
it models the probability that a person with X amount of a particular personality trait will exhibit Y amount of that trait on a personality test designed to measure it
The standard error of measurement,
it provides an estimate of the amount of error inherent in an observed score or measurement.
Inter-rater consistency
may be promoted by providing raters with the opportunity for group discussion along with practice exercises and information on rater accuracy
coefficient alpha
may be thought of as the mean of all possible split-half correlations, corrected by the Spearman-Brown formula
Parallel forms
of a test exist when, for each form of the test, the means and the variances of observed test scores are equal.
Latent-trait theories
propose models that describe how the latent trait influences performance on each test item. Unlike test scores or true scores, latent traits theoretically can take on values from −∞ to +∞ [negative infinity to positive infinity]
The standard error of measurement,
provides a measure of the precision of an observed test score
systematic error
refers to a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measure
error
refers to the component of the observed test score that does not have to do with the testtaker's ability
Inter-item consistency
refers to the degree of correlation among all the items on a scale. A measure of _________________ is calculated from a single administration of a single form of a test.
discrimination
signifies the degree to which an item differentiates among people with higher or lower levels of the trait
variance
statistic useful in describing sources of test score variability is the _____________- (σ2)—the standard deviation squared
true
t or F: Sources of error variance that occur during test administration may influence the testtaker's attention or motivation.
false -. Of the three types of estimates of reliability, measures of internal consistency are perhaps the most compatible with domain sampling theory
t or f: . Of the three types of estimates of reliability, measures of inter-scorer reliability are perhaps the most compatible with domain sampling theory.
true
t or f: A calculated APD of .25 is suggestive of problems with the internal consistency of the test.
false - . A coefficient of inter-rater reliability, for example, provides information about error as a result of test scoring
t or f: A coefficient of inter-rater reliability, for example, provides information about accuracy as a result of test scoring
true
t or f: A measure of reliability, therefore, depends on the variability of the test scores: how different the scores are from one anothe
true
t or f: Although a homogeneous test is desirable because it so readily lends itself to clear interpretation, it is often an insufficient tool for measuring multifaceted psychological variables such as intelligence or personality.
true
t or f: Because a homogeneous test samples a relatively narrow content area, it is to be expected to contain more inter-item consistency than a heterogeneous test.
true
t or f: Because a measure of the reliability of a speed test should reflect the consistency of response speed, the reliability of a speed test should not be calculated from a single administration of the test with a single time limit
true
t or f: Because error variance may increase or decrease a test score by varying amounts, consistency of the test score—and thus the reliability—can be affected.
false - Because the reliability of a test is affected by its length, a formula is necessary for estimating the reliability of a test that has been shortened or lengthened
t or f: Because the reliability of a test is affected by its length, a formula is burdensome for estimating the reliability of a test that has been shortened or lengthened
true
t or f: Because true differences are assumed to be stable, they are presumed to yield consistent scores on repeated administrations of the same test as well as on equivalent forms of tests
true
t or f: Coefficient alpha is the preferred statistic for obtaining an estimate of internal consistency reliability
false - Coefficient alpha is widely used as a measure of reliability, in part because it requires only one administration of the test
t or f: Coefficient alpha is widely used as a measure of reliability, in part because it requires multiple administration of the test
false - Error related to any of the number of possible variables operative in a testing situation can contribute to a change in a score achieved on the same test, or a parallel test, from one administration of the test to the next
t or f: Error related to any of the number of possible variables operative in a testing situation cannot contribute to a change in a score achieved on the same test, or a parallel test, from one administration of the test to the next
true
t or f: If a split-half procedure is used, then the obtained reliability coefficient is for a half test and should be adjusted using the Spearman-Brown formula.
true
t or f: If subjectivity is involved in scoring, then the scorer (or rater) can be a source of error variance.
true
t or f: If test items are more heterogeneous, KR-20 will yield lower reliability estimates than the split-half method
false - If the reliability of the original test is relatively low, then it may be impractical to increase the number of items to reach an acceptable level of reliability.
t or f: If the reliability of the original test is relatively low, then it may be practical to increase the number of items to reach an acceptable level of reliability.
false - If the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower.
t or f: If the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be higher.
false - In many tests, the advent of computer scoring and a growing reliance on objective, computer-scorable items have virtually eliminated error variance caused by scorer differences
t or f: In many tests, the advent of computer scoring and a growing reliance on objective, computer-scorable items have increased the error variance caused by scorer differences
false - In practice, the standard error of measurement is most frequently used in the interpretation of individual test scores
t or f: In practice, the standard error of measurement is most frequently used in the observation of individual test scores
false - In some tests of personality, examinees are asked to supply open-ended responses to stimuli such as pictures, words, sentences, and inkblots, and it is the examiner who must then quantify or qualitatively evaluate responses
t or f: In some tests of personality, examiners are asked to supply open-ended responses to stimuli such as pictures, words, sentences, and inkblots, and it is the examiner who must then quantify or qualitatively evaluate responses
false - Internal consistency estimates of reliability, such as that obtained by use of the Spearman-Brown formula, are inappropriate for measuring the reliability of heterogeneous tests and speed tests
t or f: Internal consistency estimates of reliability, such as that obtained by use of the Spearman-Brown formula, are most appropriate for measuring the reliability of heterogeneous tests and speed tests
true
t or f: It is even conceivable that significant changes in the testtaker's body weight could be a source of error variance.
true
t or f: Manuals for individual intelligence tests tend to be very explicit about scoring criteria, lest examinees' measured intelligence vary as a function of who is doing the testing and scoring.
false - Measures of reliability are estimates, and estimates are subject to error
t or f: Measures of reliability are absolute
false - Reduction in test size for the purpose of reducing test administration time is a common practice in certain situations.
t or f: Reduction in test size for the purpose of reducing test administration time is an illegal practice in certain situations.
true
t or f: Reduction in test size may be indicated in situations where boredom or fatigue could produce responses of questionable meaningfulness.
true
t or f: Subjectivity in scoring can even enter into behavioral assessment.
false - Test homogeneity is desirable because it allows relatively straightforward test-score interpretation.
t or f: Test heterogeneity is desirable because it allows relatively straightforward test-score interpretation.
false - Testtakers with the same score on a homogeneous test probably have similar abilities in the area tested.
t or f: Testtakers with the same score on a heterogeneous test probably have similar abilities in the area tested.
false - The amount of error in a specific test score is embodied in the standard error of measurement. But scores can change from one testing to the next for reasons other than error.
t or f: The amount of error in a specific test score is embodied in the standard error of measurement. But scores remain as is from one testing to the next for reasons other than error.
false - The element of subjectivity in scoring may be much greater in the administration of certain nonobjective-type personality tests, tests of creativity (such as the block test just described), and certain academic tests (such as essay examinations)
t or f: The element of objectivity in scoring may be much greater in the administration of certain nonobjective-type personality tests, tests of creativity (such as the block test just described), and certain academic tests (such as essay examinations)
true
t or f: The general "rule of thumb" for interpreting an APD is that an obtained value of .2 or lower is indicative of excellent internal consistency, and that a value of .25 to .2 is in the acceptable range
false - The greater the proportion of the total variance attributed to true variance, the more reliable the test
t or f: The greater the proportion of the total variance attributed to true variance, the less reliable the test is
false -The one variant of the KR-20 formula that has received the most acceptance and is in widest use today is a statistic called coefficient alpha.
t or f: The one variant of the KR-20 formula that has received the most acceptance and is in widest use today is a statistic called coefficient of beta
false - The longer the time that passes, the greater the likelihood that the reliability coefficient will be lower
t or f: The shorter the time that passes, the greater the likelihood that the reliability coefficient will be lower
true
t or f: The standard error of measurement can be used to set the confidence interval for a particular score or to determine whether a score is significantly different from a criterion
false - Unlike a Pearson r, which may range in value from −1 to +1, coefficient alpha typically ranges in value from 0 to 1
t or f: Unlike a Pearson r, which may range in value from −1 to +1, coefficient alpha typically ranges in value from 0 to 2
false- When it comes to calculating split-half reliability coefficients, there's more than one way to split a test—but there are some ways you should never split a test
t or f: When it comes to calculating split-half reliability coefficients, there's more than one way to split a test— and all types of "splitting" method are valid
false - Where test items are highly homogeneous, KR-20 and split-half reliability estimates will be similar
t or f: Where test items are highly heterogenous, KR-20 and split-half reliability estimates will be similar
false - , a challenge in test development is to maximize the proportion of the total variance that is true variance and to minimize the proportion of the total variance that is error variance.
t or f: a challenge in test development is to minimize the proportion of the total variance that is true variance and to maximize the proportion of the total variance that is error variance.
false - A test may be reliable in one context and unreliable in another.
t or f: a test is always reliable in all context
false
t or f: alternate forms and parallel forms are similar in nature
true
t or f: coefficient alpha (much like other coefficients of reliability) is calculated to help answer questions about how similar sets of data are
true
t or f: if the test is heterogeneous in items, an estimate of internal consistency might be low relative to a more appropriate estimate of test-retest reliability.
false -one way of estimating the reliability of a measuring instrument is by using the same instrument to measure the same thing at two points in time
t or f: one way of estimating the reliability of a measuring instrument is by using different instruments to measure the same thing at two points in time
false - Usually, but not always, reliability increases as test length increases. Ideally, the additional test items are equivalent with respect to the content and the range of difficulty of the original items
t or f: reliability does not increase as test length increases. Ideally, the additional test items are equivalent with respect to the content and the range of difficulty of the original items
false - scorers and scoring systems are potential sources of error variance
t or f: scorers and scoring systems are unlikely sources of error variance
true
t or f: the reliability of the instrument might be raised by creating new items, clarifying the test's instructions, or simplifying the scoring rules
true
t or f: traditional ways of estimating reliability are not always appropriate for criterion-referenced tests, though there may be instances in which traditional estimates can be adopted.
classical test theory
that a score on an ability test is presumed to reflect not only the testtaker's true score on the ability being measured but also error
(1) test-retest, (2) alternate or parallel forms, and (3) internal or inter-item consistency
there are basically three approaches to the estimation of reliability:
average proportional distance method
we define the _______________________ as a measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores
true score
we will define _______________ as a value that according to classical test theory genuinely reflects an individual's ability (or trait) level as measured by a particular test.
facets,
which include things like the number of items in the test, the amount of training the test scorers have had, and the purpose of the test administratio