Test Development: Chapter 8
DIF items
items that respondents from different groups at the same level of the underlying trait have different probabilities of endorsing as a function of their group membership
completion item
requires the examinee to provide a word or phrase that completes a sentence
item analysis
statistical procedures that assist in making judgments about which items are good as they are, which items need to be revised, and which items should be discarded
categorical scaling
stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some continuum
scoring drift
a discrepancy between scoring in an anchor protocol and the scoring of another protocol
qualitative item analysis
a general term for various non statistical procedures designed to explore how individual test items work
item-characteristic curve
a graphic representation of item difficulty and discrimination
rating scale
a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker
item-discrimination index
a measure of item discrimination, symbolized by a lowercase italic "d"
binary-choice item
a multiple choice item that contains only two possible responses. Ex: true/false
test conceptualization
an idea for a test is conceived
scalogram analysis
an item analysis procedure and approach to test development that involves a graphic mapping of a testtaker's responses
differential item functioning (DIF)
an item functions differently in one group of test takers as compared to another group of test takers known to have the same level of the underlying trait
biased test item
an item that favors one particular group of examinees in relation to another when differences in group ability are controlled
short-answer item
another name for a completion item
qualitative methods
are techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures
test development
an umbrella term for all that goes into the process of creating a test
ipsative scoring
comparing a testtaker's score on one scale within a test to another scale within that same test
multiple-choice format
has three elements, the stem, a correct answer, and several incorrect answers referred to as distractors or foils
co-validation
a test validation process conducted on two or more tests using the same sample of testtakers, also referred to as co-norming
test tryout
conditions that simulate the conditions that the final version of the test will be administered under
comparative scaling
entails judgments of a stimulus in comparison with every other stimulus on the scale
unidimensional
only one dimension is presumed to underlie the ratings
selected-response format
requires testakers to select a response from a set of alternative responses
contructed-response format
requires testakers to supply or to create the correct answer, not merely to select it
item-difficulty index
used in achievement testing
item-endorsement index
used in personality testing
Likert scale
a summative scale, used extensively in psychology, usually to scale attitudes
item format
variables such as the form, plan, structure, arrangement, and layout of individual test items
item bank
a relatively large and easily accessible collection of test questions
Guttman scale
a scaling method that yields ordinal-level measures where items range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured
test construction
a stage in the process of test development that entails writing test items, as well as formatting items, setting scoring rules, and otherwise designing and building a test.
item-validity index
a statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure
factor analysis
a statistical tool useful in determining whether items on a test appear to be measuring the same thing
sensitivity review
a study of test items, typically conducted during the test development process, in which items are examines for fairness to all prospective test takers and for the presence of offensive language stereotypes, or situations
essay item
a test item that requires the test taker to respond to a question by writing a composition, typically one that demonstrates recall of facts, understanding, analysis, and/ or interpretation
anchor protocol
a test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies
test revision
action taken to modify a test's content or format for the purpose of improving the test's effectiveness as a tool of measurement
phantom factors
factors that actually are just artifacts of the small sample size
multidimensional
more than one dimension is thought to guide the testtaker's responses
item reliability index
provides and indication of the internal consistency of a test
DIF analysis
test developers scrutinize group-by-group item response curves, looking for what are termed DIF items
ckass scoring
testtaker responses earn credit toward placement in a particular class or category with other testtakers whose pattern of responses is presumably similar in some way
method of paired comparisons
testtakers are presented with pairs of stimuli which they are asked to compare
"think aloud" test administration
testtakers take the test and think aloud while they respond to each item
item branching
the ability of the computer to tailor the content and order of presentation of test items on the basis of responses to previous items
validty shrinkage
the decrease in item validities that inevitably occurs after cross-validation of findings
item fairness
the degree, if any, a test item is biased
test revision
the development process that the test undergoes as it is modified and revised
item analysis
the different types of statistical scrutiny that the test data can potentially undergo at this point
floor effect
the diminished utility of an assessment tool for distinguishing test takers at the low wen of the ability trait, or other attribute being measured
ceiling effect
the diminished utility of an assessment tool for distinguishing testtakers at the high end of the ability, trait, or other attribute being measured
summative scale
the final test score is obtained by summing the ratings across all the items
pilot work
the preliminary research surrounding the creation of a prototype of the test
scaling
the process of setting rules for assigning numbers in measurement
item pool
the reservoir or well from which items will or will not be drawn for the final version of the test
cross-validation
the revalidation of a test on a sample of test takers other than those on whom test performance was originally found to be a valid predictor of some criterion
matching item
the testtakers is presented with two columns, premises on the left, and responses on the right. the test taker must determine which response is best associated with which premise