Psych Testing chpt 8 Test Development
Item Development in Criterion-Referenced Tests
-Ideally, each item on a criterion-oriented test addresses the issue of whether the respondent has met certain criteria. -Development of a criterion-referenced test may entail exploratory work with at least two groups of testtakers: one group known to have mastered the knowledge or skill being measured and another group known not to have mastered it.
Test Construction Writing Items
-Item Pool -Item Format -Multiple choice -Computer Administration
Writing Items for Computer Administration
-Item-bank: a relatively large and easily accessible collection of test questions -computerized adaptive testing (CAT)
Revision in New Test Development
-Items are evaluated as to their strengths and weaknesses - some items may be eliminated. -Some items may be replaced by others from the item pool. -Revised tests will then be administered under standardized conditions to a second sample -Once a test has been finalized, norms may be developed from the data and it is said to be standardized.
Stimulus of Test Conceptualization
-The stimulus could be knowledge of psychometric problems with other tests, a new social phenomenon, or any number of things. -there may be a need to assess mastery in an emerging occupation
Scoring Items of Test Construction
-cumulatively scored test -class scorings -ipsative scoring
Other Considerations in Item Analysis
-guessing -item fairness -a biased test item -speed tests
What is a "Good Item" in Test Tryout
-reliable and valid -discriminates testtakers: high scorers on the test overall answer the item correctly
Test Revision
-revision in new test development -revision in the life cycle of a test -cross-validation -co-validation -quality assurance -the use of IRT in building and revising tests
Item Development in Tests
-test items may be pilot studied to evaluate whether they should be included in the final form of the instrument
Test Tryout
-test should be tried out on the same population that it was designed for -5-10 respondents per item -should be administered in the same manner, and have the same instructions, as the final product
Test Conceptualization
-the impetus for developing a new test is some thought that "there ought to be a test for..."
3 Possible Applications of IRT
1) evaluating existing tests for the purpose of mapping test revisions, 2) determining measurement equivalence across testtaker populations, and 3) developing item banks
The Item-Validity Index
Allows test developers to evaluate the validity of items in relation to a criterion measure.
Cross-Validation
Cross-validation refers to the revalidation of a test on a sample of testtakers other than those on whom test performance was originally found to be a valid predictor of some criterion. -Item validities inevitably become smaller when administered to a second sample - validity shrinkage.
Likert Scale
Each item presents the testtaker with five alternative responses (sometimes seven), usually on an agree/disagree or approve/disapprove continuum. -typically reliable
Comparative Scaling
Entails judgments of a stimulus in comparison with every other stimulus on the scale.
Revision in the Life Cycle of a Test
Existing tests may be revised if the stimulus material or verbal material is dated, some out-dated words become offensive, norms no longer represent the population, psychometric properties could be improved, or the underlying theory behind the test has changed. -In test revision the same steps are followed as with new tests (i.e. test conceptualization, construction, item analysis, tryout, and revision).
Item Development in Norm-Referenced Tests
Generally a good item on a norm-referenced achievement test is an item for which high scorers on the test respond correctly. Low scorers respond incorrectly.
Item Format
Includes variables such as the form, plan, structure, arrangement, and layout of individual test items. -selected-response format -constructed response format
The Item-Discrimination Index
Indicates how adequately an item separates or discriminates between high scorers and low scorers on an entire test. -a measure of the difference between the proportion of high scorers answering an item correctly and the proportion of low scorers answering the item correctly
Speed Tests
Item analyses of tests taken under speed conditions yield misleading or uninterpretable results. The closer an item is to the end of the test, the more difficult it may appear to be.
The Use of IRT in Building and Revising Tests
Items are evaluated on item-characteristic curves (ICC) in which performance on items is related to underlying ability. -3 possible applications of IRT in building and revising tests
Guttman Scale
Items range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured. -All respondents who agree with the stronger statements of the attitude will also agree with milder statements.
Scaling Methods of Test Construction
Numbers can be assigned to responses to calculate test scores using a number of methods -Rating Scales -Likert Scale -Method of Paired Comparisons -Comparative Scaling -Categorical Scaling -Guttman Scale -Method of Equal-Appearing Intervals
Qualitative Methods
Qualitative methods: techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures.
Types of Scales
Scales are instruments to measure some trait, state or ability. May be categorized in many ways (e.g. multidimensional, unidemensional, etc.). -LL Thorndike was influential in development of sound scaling methods
Categorical Scaling
Stimuli (e.g. index cards) are placed into one of two or more alternative categories.
5 Stages of Test Development
Test Conceptualization --> Test Construction --> Test Tryout --> Analysis --> Revision --> [back to Test Tryout]
Guessing
Test developers and users must decide whether they wish to correct for guessing but to date no entirely satisfactory solution to correct for guessing has been achieved.
Quality Assurance
Test developers employ examiners who have experience testing members of the population targeted by the test. Examiners follow standardized procedures and undergo training. -anchor protocols are used
Analysis of Item Alternatives
The quality of each alternative within a multiple-choice item can be readily assessed with reference to the comparative performance of upper and lower scorers.
Item Pool
The reservoir or well from which items will or will not be drawn for the final version of the test. -comprehensive sampling provides a basis for content validity of the final version of the test.
Think Aloud Test Administration
Think aloud test administration - respondents are asked to verbalize their thoughts as they occur during testing.
Qualitative Item Analysis
a general term for various nonstatistical procedures designed to explore how individual test items work. -qualitative methods -think aloud test administration -expert panels -sensitivity review
Item Characteristic Curves (ICC)
a graphic representation of item difficulty and discrimination
Rating Scales
a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker. -all rating scales result in ordinal level data -some are unidimensional, others are multidimensional
Anchor Protocols
a test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies. -A discrepancy between scoring in an anchor protocol and the scoring of another protocol is referred to as scoring drift.
Co-Validation
a test validation process conducted on two or more tests using the same sample of testtakers. -economical for test developers
Computerized Adaptive Testing (CAT)
an interactive, computer-administered test-taking process wherein items presented to the testtaker are based in part on the testtaker's performance on previous items. -able to provide economy in testing time and number of items presented -tends to reduced floor effects and ceiling effects
Biased Test Item
an item that favors one particular group of examinees in relation to another when differences in group ability are controlled
Cumulatively Scored Test
assumption that the higher the score on the test, the higher the testtaker is on the ability, trait, or other characteristic that the test purports to measure.
Method of Equal-Appearing Intervals
can be used to obtain data that are interval in nature
Ipsative Scoring
comparing a testtaker's score on one scale within a test to another scale within that same test.
Test Construction
consists of -scaling -types of scales -scaling methods -writing items -scoring items
Method of Paired Comparisons
ex: select the behavior that you think would be more justified: a) cheating on taxes if one has a chance b) accepting a bribe in the course of one's duties -For each pair of options, testtakers receive a higher score for selecting the option deemed more justifiable by the majority of a group of judges. -The test score would reflect the number of times the choices of a testtaker agreed with those of the judges.
Expert Panels
experts may be employed to conduct a qualitative item analysis
Multiple-Choice
has 3 elements: 1) a stem 2) a correct alternative or option 3) distractors/foils -stem--> A psychological test, an interview, and a case study are: -correct alt. --> a)psychological assessment tools -distractors--> b) standardized behavioral samples; c) reliable assessment instruments; d) theory-linked measures
Item Reliability Index
indication of the internal consistency of the scale -Factor analysis can also provide an indication of whether items that are supposed to be measuring the same thing load on a common factor.
Sensitivity Review
items are examined in relation to fairness to all prospective testtakers. Check for offensive language, stereotypes, etc
Selected-Response Format (Item Format)
items require testtakers to select a response from a set of alternative responses. -multiple choice -matching -true-false
Constructed-Response Format (Item Format)
items require testtakers to supply or to create the correct answer, not merely to select it.
Multidimensional Rating Scales
more than one dimension is thought to underlie the ratings
Unidimensional Rating Scales
only one dimensions is presumed to underlie the ratings
Preliminary Questions of Test Conceptualization
regarding the test: -what is it designed to measure? -what is the objective? -is there a need for it? -who will take/use it? -what content will it cover? -how will it be administered? -what is the ideal format of it? -should more than one form be developed? -what special training will be required of users for administering or interpreting it? -what types of responses will be required of testtakers? -who benefits from an administration? -is there any potential harm as a result of administration? -how will meaning be attributed to scores on the test?
Class Scoring
responses earn credit toward placement in a particular class or category with other testtakers whose pattern of responses is presumably similar in some way (e.g diagnostic testing).
Test Development
test development is an umbrella term for all that goes into the process of creating a test
Item Fairness
the degree, if any, a test item is biased
Item Analysis
the nature of the item analysis will vary depending on the goals of the test developer -among the tools test developers might employ to analyze and select items are: an index of the item's difficulty, reliability, validity and discrimination
Scaling
the process of setting rules for assigning numbers in measurement
Item-Difficulty Index
the proportion of respondents answering an item correctly -For maximum discrimination among the abilities of the testtakers, the optimal average item difficulty is approximately .5, with individual items on the test ranging in difficulty from about .3 to .8.