Psychological Testing: Chapter 5
Test construction error variance
- item sampling or content sampling: terms that refer to variation among items within a test as well as to variation among items between tests. The extent to which a testtaker's score is affected by the content sampled on a test and by the way the content is sampled (that is, the way in which the item is constructed) is a source of error variance.
Criterion-referenced tests
A criterion-referenced test is designed to provide an indication of where a testtaker stands with respect to some variable or criterion, such as an educational or a vocational objective. Unlike norm-referenced tests, criterionreferenced tests tend to contain material that has been mastered in hierarchical fashion.
split-half reliability
An estimate of split-half reliability is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice (because of factors such as time or expense). The computation of a coefficient of split-half reliability generally entails three steps: Step 1. Divide the test into equivalent halves. Step 2. Calculate a Pearson r between scores on the two halves of the test. Step 3. Adjust the half-test reliability using the Spearman-Brown formula (discussed shortly).
Item information curve
An item information curve can be a very useful tool for test developers. It is used, for example, to reduce the total number of test items in a "long form" of a test and so create a new and effective "short form." Shorter versions of tests are created through selection of the most informative set of items (questions) that are relevant for the population under study.
Characteristics of IRT framework items
Examples of two characteristics of items within an IRT framework are the difficulty level of an item and the item's level of discrimination; items may be viewed as varying in terms of these, as well as other, characteristics.
restriction of range or restriction of variance
If the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower. If the variance of either variable in a correlational analysis is inflated by the sampling procedure, then the resulting correlation coefficient tends to be higher.
discrimination
In the context of IRT, discrimination signifies the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or whatever it is that is being measured.
A domain of behavior
Or the universe of items that could conceivably measure that behavior, can be thought of as a hypothetical construct: one that shares certain characteristics with (and is measured by) the sample of items that make up the test.
Parallel forms of a test
Parallel forms of a test exist when, for each form of the test, the means and the variances of observed test scores are equal.
coefficient of inter-scorer reliability
Perhaps the simplest way of determining the degree of consistency among scorers in the scoring of a test is to calculate a coefficient of correlation. This correlation coefficient is referred to as a coefficient of inter-scorer reliability .
APD
Rather than focusing on similarity between scores on items of a test (as do split-half methods and Cronbach's alpha), the APD is a measure that focuses on the degree of difference that exists between item scores. Accordingly, we define the average proportional distance method as a measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores.
Score error variance
Scorers and scoring systems are potential sources of error variance. A test may employ objective-type items amenable to computer scoring of well-documented reliability. Yet even then, a technical glitch might contaminate the data. If subjectivity is involved in scoring, then the scorer (or rater) can be a source of error variance.
Test administration error variance
Sources of error variance that occur during test administration may influence the testtaker's attention or motivation. - testtaker's reactions to influences are the source of one kind of error variance. - e.g room temperature, level of lighting, and amount of ventilation and noise, for instance.
Homogeneity
Tests are said to be homogeneous if they contain items that measure a single trait. As an adjective used to describe test items, homogeneity (derived from the Greek words homos, meaning "same," and genos, meaning "kind") is the degree to which a test measures a single factor. In other words, homogeneity is the extent to which items in a scale are unifactorial.
local independence
The assumption of local independence means that (a) there is a systematic relationship between all of the test items and (b) that relationship has to do with the theta level of the testtaker. When the assumption of local independence is met, it means that differences in responses to items are reflective of differences in the underlying trait or ability.
monotonicity
The assumption of monotonicity means that the probability of endorsing or selecting an item response indicative of higher levels of theta should increase as the underlying level of theta increases.
coefficient of equivalence
The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability, which is often termed the coefficient of equivalence
item response theory
The procedures of item response theory provide a way to model the probability that a person with X ability will be able to perform at a level of Y. a synonym for IRT in the academic literature is latent-trait theory
Reliability
The proportion of total variance attributed to true variance. The greater the proportion of the total variance attributed to true variance, the more reliable the test. Because true differences are assumed to be stable, they are presumed to yield consistent scores on repeated administrations of the same test as well as on equivalent forms of tests. Because error variance may increase or decrease a test score by varying amounts, consistency of the test score—and thus the reliability—can be affected.
standard error of measurement
The standard error of measurement, often abbreviated as SEM or SE M , provides a measure of the precision of an observed test score. Stated another way, it provides an estimate of the amount of error inherent in an observed score or measurement. In general, the relationship between the SEM and the reliability of a test is inverse; the higher the reliability of a test (or individual subtest within a test), the lower the SEM. The standard error of measurement is the tool used to estimate or infer the extent to which an observed score deviates from a true score. We may define the standard error of measurement as the standard deviation of a theoretically normal distribution of test scores obtained by one person on equivalent tests.
standard error of the difference
The standard error of the difference between two scores can be the appropriate statistical tool to address three types of questions: 1. How did this individual's performance on test 1 compare with his or her performance on test 2? 2. How did this individual's performance on test 1 compare with someone else's performance on test 1? 3. How did this individual's performance on test 1 compare with someone else's performance on test 2?
unidimensionality
The unidimensionality assumption posits that the set of items measures a single continuous latent construct. Stated another way, it is a person's theta level that gives rise to a response to the items in the scale.
Usefulness of standard error of measurement
Useful in answering such questions is an estimate of the amount of error in an observed test score. The standard error of measurement provides such an estimate. Further, the standard error of measurement is useful in establishing what is called a confidence interval: a range or band of test scores that is likely to contain the true score.
three approaches to the estimation of reliability
We have seen that, with respect to the test itself, there are basically three approaches to the estimation of reliability: (1) test-retest, (2) alternate or parallel forms, and (3) internal or inter-item consistency.
power test
When a time limit is long enough to allow testtakers to attempt all items, and if some items are so difficult that no testtaker is able to obtain a perfect score, then the test is a power test
coefficient of stability
When the interval between testing is greater than six months, the estimate of test-retest reliability is often referred to as the coefficient of stability
true score
a value that according to classical test theory genuinely reflects an individual's ability (or trait) level as measured by a particular test.
Examiner-related variables
are potential sources of error variance. The examiner's physical appearance and demeanour are some factors for consideration here. - departure from the procedure prescribed for a particular test. - may unwittingly provide clues by emphasizing key words as they pose questions. - might convey information about a response through head nodding, eye movements, or other nonverbal gestures.
heterogeneity
describes the degree to which a test measures different factors. A heterogeneous (or nonhomogeneous) test is composed of items that measure more than one trait.
decision study
developers examine the usefulness of test scores in helping the test user make decisions. In practice, test scores are used to guide a variety of decisions, from placing a child in special education to hiring new employees to discharging mental patients from the hospital. The decision study is designed to tell the test user how test scores should be used and how dependable those scores are as a basis for decisions, depending on the context of their use.
Alternate forms
different versions of a test that have been constructed so as to be parallel. Although they do not meet the requirements for the legitimate designation "parallel," alternate forms of a test are typically designed to be equivalent with respect to variables such as content and level of difficulty.
generalizability study
examines how generalizable scores from a particular test are if the test is administered in different situations. examines how much of an impact different facets of the universe have on the test score.
speed test
generally contains items of uniform level of difficulty (typically uniformly low) so that, when given generous time limits, all testtakers should be able to complete all the test items correctly.
Random error
is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process.
A dynamic characteristic
is a trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences. If, for example, one were to take hourly measurements of the dynamic characteristic of anxiety as manifested by a stockbroker throughout a business day, one might find the measured level of this characteristic to change from hour to hour.
Test-retest reliability
is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test. The test-retest measure is appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time, such as a personality trait. If the characteristic being measured is assumed to fluctuate over time, then there would be little sense in assessing the reliability of the test using the test-retest method
reliability coefficient
is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance.
generalisability theory
is based on the idea that a person's test scores vary from testing to testing because of variables in the testing situation. Instead of conceiving of all variability in a person's scores as error, Cronbach encouraged test developers and researchers to describe the details of the particular test situation or universe leading to a specific test score. This universe is described in terms of its facets , which include things like the number of items in the test, the amount of training the test scorers have had, and the purpose of the test administration. According to generalizability theory, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained. This test score is the universe score, and it is, as Cronbach noted, analogous to a true score in the true score model.
inter-scorer reliability
is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure. If the reliability coefficient is high, the prospective test user knows that test scores can be derived in a systematic, consistent way by various scorers with sufficient training.
KR-20
is the statistic of choice for determining the inter-item consistency of dichotomous items, primarily those items that can be scored right or wrong (such as multiple-choice items). If test items are more heterogeneous, KR-20 will yield lower reliability estimates than the split-half method.
coefficient alpha
may be thought of as the mean of all possible split-half correlations, corrected by the Spearman-Brown formula. In contrast to KR-20, which is appropriately used only on tests with dichotomous items, coefficient alpha is appropriate for use on tests containing nondichotomous items.
domain sampling theory
proponents of domain sampling theory seek to estimate the extent to which specific sources of variation under defined conditions are contributing to the test score. In domain sampling theory, a test's reliability is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample
systematic error
refers to a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured.
parallel forms reliability
refers to an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means and variances of observed test scores are equal.
alternate forms reliability
refers to an estimate of the extent to which these different forms of the same test have been affected item sampling error, or other error.
reliability
refers to consistency in measurement. And whereas in everyday conversation reliability always connotes something positive, in the psychometric sense it really only refers to something that is consistent—not necessarily consistently good or bad, but simply consistent.
Difficulty
refers to the attribute of not being easily accomplished, solved, or comprehended.
error
refers to the component of the observed test score that does not have to do with the testtaker's ability.
Inter-item consistency
refers to the degree of correlation among all the items on a scale. A measure of inter-item consistency is calculated from a single administration of a single form of a test.
measurement error
refers to, collectively, all of the factors associated with the process of measuring some variable, other than the variable being measured. Measurement error, much like error in general, can be categorized as being either systematic or random.
Assumptions in Using IRT
(1) unidimensionality, (2) local independence, and (3) monotonicity.
The Spearman-Brown formula
allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test.
Sources of error variance
include test construction, administration, scoring, and/or interpretation.