Psych Testing chpt 8 Test Development

¡Supera tus tareas y exámenes ahora con Quizwiz!

Writing Items for Computer Administration

-Item-bank: a relatively large and easily accessible collection of test questions -computerized adaptive testing (CAT)

Scoring Items of Test Construction

-cumulatively scored test -class scorings -ipsative scoring

What is a "Good Item" in Test Tryout

-reliable and valid -discriminates testtakers: high scorers on the test overall answer the item correctly

Test Revision

-revision in new test development -revision in the life cycle of a test -cross-validation -co-validation -quality assurance -the use of IRT in building and revising tests

Item Development in Tests

-test items may be pilot studied to evaluate whether they should be included in the final form of the instrument

Test Conceptualization

-the impetus for developing a new test is some thought that "there ought to be a test for..."

3 Possible Applications of IRT

1) evaluating existing tests for the purpose of mapping test revisions, 2) determining measurement equivalence across testtaker populations, and 3) developing item banks

The Item-Validity Index

Allows test developers to evaluate the validity of items in relation to a criterion measure.

Cross-Validation

Cross-validation refers to the revalidation of a test on a sample of testtakers other than those on whom test performance was originally found to be a valid predictor of some criterion. -Item validities inevitably become smaller when administered to a second sample - validity shrinkage.

Likert Scale

Each item presents the testtaker with five alternative responses (sometimes seven), usually on an agree/disagree or approve/disapprove continuum. -typically reliable

Comparative Scaling

Entails judgments of a stimulus in comparison with every other stimulus on the scale.

Qualitative Methods

Qualitative methods: techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures.

Types of Scales

Scales are instruments to measure some trait, state or ability. May be categorized in many ways (e.g. multidimensional, unidemensional, etc.). -LL Thorndike was influential in development of sound scaling methods

Categorical Scaling

Stimuli (e.g. index cards) are placed into one of two or more alternative categories.

5 Stages of Test Development

Test Conceptualization --> Test Construction --> Test Tryout --> Analysis --> Revision --> [back to Test Tryout]

Guessing

Test developers and users must decide whether they wish to correct for guessing but to date no entirely satisfactory solution to correct for guessing has been achieved.

Analysis of Item Alternatives

The quality of each alternative within a multiple-choice item can be readily assessed with reference to the comparative performance of upper and lower scorers.

Think Aloud Test Administration

Think aloud test administration - respondents are asked to verbalize their thoughts as they occur during testing.

Item Characteristic Curves (ICC)

a graphic representation of item difficulty and discrimination

Method of Equal-Appearing Intervals

can be used to obtain data that are interval in nature

Expert Panels

experts may be employed to conduct a qualitative item analysis

Sensitivity Review

items are examined in relation to fairness to all prospective testtakers. Check for offensive language, stereotypes, etc

Multidimensional Rating Scales

more than one dimension is thought to underlie the ratings

Unidimensional Rating Scales

only one dimensions is presumed to underlie the ratings

Class Scoring

responses earn credit toward placement in a particular class or category with other testtakers whose pattern of responses is presumably similar in some way (e.g diagnostic testing).

Item-Difficulty Index

the proportion of respondents answering an item correctly -For maximum discrimination among the abilities of the testtakers, the optimal average item difficulty is approximately .5, with individual items on the test ranging in difficulty from about .3 to .8.

Other Considerations in Item Analysis

-guessing -item fairness -a biased test item -speed tests

Test Tryout

-test should be tried out on the same population that it was designed for -5-10 respondents per item -should be administered in the same manner, and have the same instructions, as the final product

Revision in the Life Cycle of a Test

Existing tests may be revised if the stimulus material or verbal material is dated, some out-dated words become offensive, norms no longer represent the population, psychometric properties could be improved, or the underlying theory behind the test has changed. -In test revision the same steps are followed as with new tests (i.e. test conceptualization, construction, item analysis, tryout, and revision).

Item Development in Norm-Referenced Tests

Generally a good item on a norm-referenced achievement test is an item for which high scorers on the test respond correctly. Low scorers respond incorrectly.

Item Format

Includes variables such as the form, plan, structure, arrangement, and layout of individual test items. -selected-response format -constructed response format

The Item-Discrimination Index

Indicates how adequately an item separates or discriminates between high scorers and low scorers on an entire test. -a measure of the difference between the proportion of high scorers answering an item correctly and the proportion of low scorers answering the item correctly

Speed Tests

Item analyses of tests taken under speed conditions yield misleading or uninterpretable results. The closer an item is to the end of the test, the more difficult it may appear to be.

The Use of IRT in Building and Revising Tests

Items are evaluated on item-characteristic curves (ICC) in which performance on items is related to underlying ability. -3 possible applications of IRT in building and revising tests

Guttman Scale

Items range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured. -All respondents who agree with the stronger statements of the attitude will also agree with milder statements.

Scaling Methods of Test Construction

Numbers can be assigned to responses to calculate test scores using a number of methods -Rating Scales -Likert Scale -Method of Paired Comparisons -Comparative Scaling -Categorical Scaling -Guttman Scale -Method of Equal-Appearing Intervals

Quality Assurance

Test developers employ examiners who have experience testing members of the population targeted by the test. Examiners follow standardized procedures and undergo training. -anchor protocols are used

Item Pool

The reservoir or well from which items will or will not be drawn for the final version of the test. -comprehensive sampling provides a basis for content validity of the final version of the test.

Qualitative Item Analysis

a general term for various nonstatistical procedures designed to explore how individual test items work. -qualitative methods -think aloud test administration -expert panels -sensitivity review

Rating Scales

a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker. -all rating scales result in ordinal level data -some are unidimensional, others are multidimensional

Co-Validation

a test validation process conducted on two or more tests using the same sample of testtakers. -economical for test developers

Computerized Adaptive Testing (CAT)

an interactive, computer-administered test-taking process wherein items presented to the testtaker are based in part on the testtaker's performance on previous items. -able to provide economy in testing time and number of items presented -tends to reduced floor effects and ceiling effects

Biased Test Item

an item that favors one particular group of examinees in relation to another when differences in group ability are controlled

Cumulatively Scored Test

assumption that the higher the score on the test, the higher the testtaker is on the ability, trait, or other characteristic that the test purports to measure.

Ipsative Scoring

comparing a testtaker's score on one scale within a test to another scale within that same test.

Test Construction

consists of -scaling -types of scales -scaling methods -writing items -scoring items

Method of Paired Comparisons

ex: select the behavior that you think would be more justified: a) cheating on taxes if one has a chance b) accepting a bribe in the course of one's duties -For each pair of options, testtakers receive a higher score for selecting the option deemed more justifiable by the majority of a group of judges. -The test score would reflect the number of times the choices of a testtaker agreed with those of the judges.

Multiple-Choice

has 3 elements: 1) a stem 2) a correct alternative or option 3) distractors/foils -stem--> A psychological test, an interview, and a case study are: -correct alt. --> a)psychological assessment tools -distractors--> b) standardized behavioral samples; c) reliable assessment instruments; d) theory-linked measures

Item Reliability Index

indication of the internal consistency of the scale -Factor analysis can also provide an indication of whether items that are supposed to be measuring the same thing load on a common factor.

Preliminary Questions of Test Conceptualization

regarding the test: -what is it designed to measure? -what is the objective? -is there a need for it? -who will take/use it? -what content will it cover? -how will it be administered? -what is the ideal format of it? -should more than one form be developed? -what special training will be required of users for administering or interpreting it? -what types of responses will be required of testtakers? -who benefits from an administration? -is there any potential harm as a result of administration? -how will meaning be attributed to scores on the test?

Test Development

test development is an umbrella term for all that goes into the process of creating a test

Item Fairness

the degree, if any, a test item is biased

Item Development in Criterion-Referenced Tests

-Ideally, each item on a criterion-oriented test addresses the issue of whether the respondent has met certain criteria. -Development of a criterion-referenced test may entail exploratory work with at least two groups of testtakers: one group known to have mastered the knowledge or skill being measured and another group known not to have mastered it.

Test Construction Writing Items

-Item Pool -Item Format -Multiple choice -Computer Administration

Revision in New Test Development

-Items are evaluated as to their strengths and weaknesses - some items may be eliminated. -Some items may be replaced by others from the item pool. -Revised tests will then be administered under standardized conditions to a second sample -Once a test has been finalized, norms may be developed from the data and it is said to be standardized.

Stimulus of Test Conceptualization

-The stimulus could be knowledge of psychometric problems with other tests, a new social phenomenon, or any number of things. -there may be a need to assess mastery in an emerging occupation

Anchor Protocols

a test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies. -A discrepancy between scoring in an anchor protocol and the scoring of another protocol is referred to as scoring drift.

Selected-Response Format (Item Format)

items require testtakers to select a response from a set of alternative responses. -multiple choice -matching -true-false

Constructed-Response Format (Item Format)

items require testtakers to supply or to create the correct answer, not merely to select it.

Item Analysis

the nature of the item analysis will vary depending on the goals of the test developer -among the tools test developers might employ to analyze and select items are: an index of the item's difficulty, reliability, validity and discrimination

Scaling

the process of setting rules for assigning numbers in measurement

Ver todos los conjuntos de estudio

Psych Testing chpt 8 Test Development

Conjuntos de estudio relacionados

cell bio 11&12

Music 101:Chapter 9:Lesson 2 Jazz Masters: Louis Armstrong, Charlie Parker, Cole Porter & More

fundamentals

Test 1

AP Statistics Review

Physio Exam 3

8001 מילון קוויזליט

A&P chapter 10 Study Guide Part 3

Smartbook Ch 5-4

biology ch 2

Lesson 2 Unit 3: An Oasis of Art: Art Written in Stone

PSYCH FINAL Practice #6

ESTADOS E CAPITAIS DO BRASIL

Real Estate

mrkt chapter 5

Fluid and electrolytes

Global Society Exam 2

Micro Chapter 28

NUR 2050 Exam 1

CPA Reg Review