Assessment - Chapter 8 (Test Development)
Item analysis tools
*Item-difficulty index *Item-reliability index *Item-validity index *Item-discrimination index
Item-validity index can be calculated once these factors are known:
*the item-score standard deviation. *the correlation between the item score and the criterion score.
5 stages of test development
1. test conceptualization 2. test construction 3. test tryout 4. item analysis 5. test revision
Test Revision in the Life Cycle of an Existing Test includes:
Cross-validation and co-validation and Quality assurance during test revision The test revision process it typically includes all of the steps that the initial test development included.
Developing item banks
Each of the test items assembled as part of an item bank, have undergone rigorous qualitative and quantitative evaluation. Many item banking efforts begin with the collection of appropriate items from existing instruments.
Test conceptualization
Idea for a test is conceived
Speed tests
Item analysis of tests taken under speed conditions yield misleading or uninterpretable results. discrimination levels are higher toward the end of the test.
Writing items
Item format Item bank Item analysis Writing items for computer administration
Item response theory (IRT)
Item statistics are independent of the samples which have been administered the test. Test items can be matched to ability levels. Facilitate advanced psychometric tools/methods.
Test tryout
Once a preliminary form of the test is developed, it is administered under this. The data will be collected and testtakers' performance on the test as a whole and on each to assist in making judgments about which items are good as they are, which items need to be revised, and which items need to be discarded.
Item fairness
Refers to the degree a test item is biased.
Selected-response format
Require testtakers to select a response from a set of alternative responses.
Constructed-response format
Require testtakers to supply or to create the correct answer, not merely to select it.
Classical test theory
Smaller sample sizes required for testing. Utilizes relatively simple mathematical models. Assumptions underlying this theory are weak allowing this theory wide applicability. Most researchers are familiar with this basic approach Many data analysis and statistics-related software packages are built from this perspective.
Some forms of content bias
Status, stereotype, familiarity, offensive choice of words, other.
Categorical scaling
Stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to continuum.
DIF analysis
Test developers scrutinize group-by-group item response curves, looking for what are termed DIF items.
Method of paired comparisons
Testtakers are presented with pairs of stimuli, which they are asked to compare. Had Katz et al. used the method on their scale.
Item branching
The ability of the computer to tailor the content and order of presentation of test items on the basis of responses to previous items.
Validity shrinkage
The decrease in item validities that inevitably occurs after cross-validation of findings. Expected and is viewed as integral to the test development process.
Summative scale
The final test score is obtained by summing the ratings across all the items.
Item-endorsement index
The statistic provides not a measure of the percentage of people passing an item by a measure of the percent of people who said yes to, agreed with, or otherwise endorsed the item.
Test development
Umbrella term for all that goes into the process of creating a test.
Co-norming
When co-validation is used in conjunction with the creation of norms or the revision of existing norms.
Criterion-referenced instruments derives from:
a conceptualization of the knowledge or skills to be mastered.
Qualitative item analysis
a general term for various nonstatistical procedures designed to explore how individual test items work. Compares individual test items to each other and to the test as a whole.
Rating scale
a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker.
Item bank
a relatively large and easily accessible collection of test questions for which the test can take items.
Test construction
a stage in the process of test development that entails writing test items (or re-writing or revising existing items), as well as formatting items, setting scoring rules, and otherwise designing and building the final version of the test.
Sensitivity review
a study of test items, typically conducted during the test development process, in which items are examined for fairness to all prospective testtakers and for the presence of offensive language, stereotypes, or situations.
Co-validation
a test validation process conducted on two or more tests using the same sample of testtakers.
Types of scales
a. age-based scale b. grade-based scale c. stanine scale
Test revision
action taken to modify a test's content or format for the purpose of improving the test's effectiveness as a tool of measurement. Usually based on item analysis, as well as related info. derived from the test tryout. The revised version will then be tried out on a new sample of testtakers.
Computerized adaptive testing (CAT)
an interactive, computer-administered test-taking process wherein items presented to the testtaker are based in part on the testtaker's performance on previous items.
Biased test item
an item that favors one particular group of examinees in relation to another when differences in group ability are controlled.
Ipsative scoring
comparing testtaker's score on one scale within a test to another scale within that same test.
Expert panels
may provide qualitative analysis of test items. In the test development process, a group of people knowledgeable about the subject matter being tested and/or the population for whom the test was designed who can provide input to improve the test's content, fairness, and other related ways. Are used in the process of test development to screen test items for possible bias.
Floor effect
refers to the diminished utility of an assessment tool for distinguishing testtakers at the low end of the ability, trait, or other attribute being measured.
Guessing
test takers predict or presume the response. Test developers should plan corrections for guessing. It poses methodological problems for the test developer.
Matching item
testtaker is presented with two columns: premises on the left and responses on the right. Determine which response is best associated with which premise.
Class scoring (category scoring)
testtaker responses earn credit toward placement in a particular class or category with other testtakers whose pattern of responses is presumably similar in some way.
Item pool
the reservoir or well from which items will or will not be drawn for the final version of the test.
Idea for a test may come from:
social need, review of the available literature, and common sense appeal.
3 criteria that any correction for guessing must meet as well as the other interacting issues that must be addressed.
1. A correction must recognize that a guess is not typically made on a totally random basis. 2. Correction for guessing must also deal with the problem of omitted items. 3. Lucky guessing
Tests are deemed to be due for revision if the following exist:
1. Stimulus materials look dated and current testtakers cannot relate to them. 2. Verbal content of the test, including the administration instructions and the test items, contains dated vocabulary that is not readily understood by current testtakers. 3. As pop culture changes and words take on new eaning, certain words or expressions in the test items or directions may be perceived as inappropriate. 4. Test norms are no longer adequate as a result of group membership changes in the pop of potential testttakers. 5. Test norms are no longer adequate as a result fo age-related shifts in the abilities measured over time, so an age extension of the norms is necessary. 6. Reliability or validity of the test, as well as the effectiveness of individual test items, can be significantly improved by revision. 7. Theory on which the test was originally based has been improved significantly, and these changes should be reflected in the design and content of the test.
Multiple-choice format
3 elements: 1. a stem, 2.. a correct alternative or option, 3. several incorrect alternatives or options variously referred to as distractors or foils.
Item-characteristic curve
A graphic representation of item difficulty and discrimination.
Item-discrimination index
A measure of item discrimination, symbolized by a lowercase italic (d). this estimate of item discrimination compares performance on a particular item with performance in the upper and lower regions of a distribution of continuous test scores.
Binary-choice item
A multiple-choice item that contains only two possible responses.
Anchor protocol
A test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies.
Scaling methods
A test taker is presumed to have more or less of the characteristics measured by a (valid) test as a function of the test score. The higher or lower the score, the more or less of the characteristics he or she presumably possesses.
Likert scale
A type of summative rating scale. Usually used to scale attitudes. Each item presents the testtaker with 5 alternative responses. Usually reliable.
Item analysis
After the first draft of the test has been administered to a representative group of examinees, the test developer analyzed test scores and responses to individual items. Different group of statistical scrutiny that the test data can potentially undergo.
Differential item functioning (DIF)
An item functions differently in one group of testtakers as compared to another group of testtakers known to have the same level of the underlying trait.
"think aloud" test administration
Cohen et al. proposed qualitative research tool designed to shed light on the testtaker's thought processes during the administration of a test.
True-false item
Contains a single-idea, is not excessively long, and is not subject to debate; the correct response must undoubtedly be one of the two choices.
Item format
Variables such as the form, plan, structure, arrangement, and layout of the individual test items. Two types: selected-response format and the constructed-response format.
Methods of evaluating item bias:
a) noting differences between the item-characteristics curves, b) noting differences in the item-difficulty levels, noting differences in the item discrimination indexes.
Criterion referenced tests
individuals' scores are given meaning by comparison to a standard or criterion. They are typically used in licensing for occupation. Examples: Driver's license exam; SAT; academic skills assessment
Item-difficulty index
is obtained by calculating the proportion of the total number of test takers who got the item right. An item with a mid-range difficulty level is likely to be "good."
Pilot work
preliminary research surrounding the creation of a prototype of the test. The items may be pilot studied to evaluate whether they should be included in the final form of the instrument. Test developer typically attempts to determine how best to measure a targeted construct.
Scoring items
the most commonly used model is the cumulative model, due in large part to its simplicity and logic. Typically, the rule in a cumulatively scored test is that the higher the score on the test, the higher the test taker is on the ability, the trait, or some other characteristic the test purports to measure.
Scaling
the process of setting rules for assigning numbers in measurement. The measuring device is designed and calibrated and by which numbers are assigned to different amounts of the trait, attribute, or characteristic being measured. L. L. Thurston at the forefront of efforts to develop methodologically sound scaling methods.
Test tryout
the test should be tried out on people who are simular tin critical respects as the people for whom the test was designed. Executed under conditions as identical as possible to the conditions under which the standardized test will be administered; all instructions, etc.
Qualitative methods
Techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures.
Guttman scale
A scaling method that yields ordinal-level measures. Items on it range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured. (If all respondents who agree with a should also agree with b, c, and d.
Test revision
A stage in New Test Development. Act judiciously on all information and mold the test into its final form. Some items from the original item pool will be eliminated and others will be rewritten.
Item-validity index
A statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure.
Factor analysis
A statistical tool useful in determining whether items on a test to be measuring the same thing(s).
Designing an item bank
1. Items (a. acquisition and development, b. classification, c. management). 2. Tests (a. assembly, b. administration, scoring, and reporting, c. evaluation). 3. System (a. acquisition and development, b. software and hardware, c. monitoring and training, d. access and security). 4. Use and Acceptance (a. general, b. instructional improvement, c. adaptive testing, certification of competence, e. program and curriculum eval., f. testing and reporting requirements imposed by external agencies). 5. Costs (A. cost feasibility, B. Cost comparisons).
Some preliminary questions in test conceptualization
1. What is the test designed to measure? 2. What is the objective of the test? 3. Is there a need for this test? 4. Who will use this test? 5.. Who will take this test? 6. What content will the test cover? 7. How will the test be administered?
Scoring drift
A discrepancy between scoring in an anchor protocol and the scoring of another protocol.
Comparative scaling
One method of sorting that entails judgments of a stimulus in comparison to every other stimulus on the scale.
Completion time
Requires the examinee to provide a word or phrase that completes a sentence. AKA short-answer item.
Scalogram analysis
Resulting data of a Guttman scale are analyzed through this item-analysis procedure and approach to test development that involves a graphic mapping of a testtaker's responses.
Norm referenced tests
individuals' scores are given meaning by comparison to normative sample. Examples: ACT, GRE, WAIS III, Iowa Tests of Basic Skills
Item-reliability index
provides an indication of the internal consistency of a test. the higher this index, the greater the test's internal consistency. This index is equal to the product of the item-score standard deviation (s) and the correlation (r) between the item score and the total test score.
Ceiling effect
the diminished utility of an assessment tool for distinguishing testtakers at the high end of the ability, trait, or other attribute being measured.
Cross validation
the revalidation of a test on a sample of testtakers other than those on whom test performance was originally found to be a valid predictor of some criterion.
DIF items
those items that respondents from different groups at the same level of the underlying trait have different probabilities of endorsing as a function of their group membership.