Chapter 8 - Test Development

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Computerized adaptive testing (CAT)

- An interactive, computer- administered test-taking process wherein items presented to the testtaker are based in part on the testtaker's performance on previous items (item branching) - Is able to provide economy in testing time and number of items presented - Tends to reduce floor effects and ceiling effects.

Cumulatively scored

- Assumption that the higher the score on the test, the higher the testtaker is on the ability, trait, or other characteristic that the test purports to measure

Ipsative scoring

- Comparing a testtaker's score on one scale within a test to another scale within that same test - It would not be appropriate to draw inter-individual comparisons using this type of scoring

Pilot work

- Create a prototype and receive feedback --> Focus groups; expert panels - Can talk outloud and speak freely about what they are thinking of the test

Cross-validation and Co-validation

- Cross-validation refers to the revalidation of a test on a sample of testtakers other than those on whom test performance was originally found to be a valid predictor of some criterion. --> Be aware of validity shrinkage - Co-validation: a test validation process conducted on two or more tests using the sample sample of testtakers. --> Co-validation is economical for test developers --> Minimizes sampling error --> Co-norming: when a fixed battery of test were all normed on the same group or population --> Saves time and money

Ceiling effect

- Diminished ability to distinguish testtakers at the high end of the continuum - If we were measuring nuns in the integrity example, they will be at the high end of the spectrum

Floor effect

- Diminished ability to distinguish testtakers at the low end of the continuum - When looking at prisoners, you are looking at the low end of the entire population when looking at integrity

Comparative scaling

- Entails judgment of a stimulus in comparison with another stimulus on the scale. - Ex: like comparing one brand or product against another

Revision in the Life Cycle of a Test

- Existing tests may be revised if the stimulus material or verbal material is dated, some out-dated words become offensive, norms no longer represent the population, psychometric properties could be improved, or the underlying theory behind the test has changed. - In test revision the same steps are followed as with new tests (i.e., test conceptualization, construction, item analysis, tryout, and revision). - Cross-validation and Co-validation - The use of IRT in Building and Revising Tests

Other considerations in item analysis

- Guessing: Test developers and users must decide whether they wish to correct for guessing but to date no entirely satisfactory solution to correct for guessing has been achieved - Item fairness: the degree, if any, a test item is biased --> A biased test item is an item that favors one particular group of examinees in relation to another when differences in group ability are controlled --> Differential Item Functioning: test item functions differently in one group of test takers as compared to another group of test takers with the same underlying trait - Speed tests: The closer an item is to the end of the test, the more difficult it may appear to be

Item formats

- Includes variables such as the form, plan, structure, arrangement, and layout of individual test items. - Selected-response format: items require testtakers to select a response from a set of alternative responses. --> Includes all types of questions where test takers have to pick the correct option(s) out of a list: multiple-choice, true/false, or multiple response questions. - Constructed-response format: items require testtakers to supply or to create the correct answer, not merely to select it. - One selected-response format: multiple-choice --> Typically three elements: (1) a stem, (2) a correct alternative or option, and (3) several incorrect alternatives or options variously referred to as distractors or foils. - Other commonly used selective response formats include matching and true-false items

Item analysis

- Item-Difficulty Index: The proportion of respondents answering an item correctly - Item-Endorsement index: The percentage of agreement as opposed to percentage correct. - Item Reliability Index: Indication of the internal consistency of the scale --> Factor analysis can also provide an indication of whether items that are supposed to be measuring the same thing load on a common factor - The Item-Validity Index: Allows test developers to evaluate the validity of items in relation to a criterion measure - The Item-Discrimination Index: Indicates how adequately an item separates or discriminates between high scorers and low scorers --> d-value: the proportion of high scorers answering an item correctly and the proportion of low scorers answering the item correctly

Revision in New Test Development

- Items are evaluated as to their strengths and weaknesses - some items may be eliminated - Some items may be replaced by others from the item pool - Revised tests will then be administered under standardized conditions to a second sample - Once a test has been finalized, norms may be developed from the data and it is said to be standardized.

The use of IRT in Building and Revising Tests

- Items are evaluated on item-characteristic curves (ICC) in which performance on items is related to underlying ability. - Three possible applications of IRT in building and revising tests include (1) evaluating existing tests for the purpose of mapping test revisions, (2) determining measurement equivalence across testtaker populations, and (3) developing item banks.

Guttman scale

- Items range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured. - All respondents who agree with the stronger statements of the attitude will also agree with milder statements. - Focuses on the set of statements that the respondent agrees with on a particular subject

Qualitative item analysis

- Qualitative methods: techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures - Think aloud test administration: respondents are asked to verbalize their thoughts as they occur during testing - Expert panels: Experts may be employed to conduct a qualitative item analysis - Sensitivity review: items are examined in relation to fairness to all prospective testtakers. Check for offensive language, stereotypes, etc.

Scaling

- Quantifying or calibrating the measure - Type of scales: unidimensional, multidimensional, categorical, ordinal, etc. - How are we going to understand the responses we are getting? - Is it dichotomous: true/false - Likert scales are common in psychology: typically reliable - Some rating scales are unidimensional, meaning that only one dimension is presumed to underlie the ratings --> Height - Other rating scales are multidimensional, meaning that more than one dimension is thought to underlie the ratings --> Google: scaling method that represents perceived similarities among stimuli by arranging similar stimuli in spatial proximity to one another, while disparate stimuli are represented far apart from one another.

Class scoring

- Responses earn credit toward placement in a particular class or category with other testtakers whose pattern of responses is presumably similar in some way (e.g., diagnostic testing) - Category scoring - Used by some diagnostic systems wherein individuals must exhibit a certain number of symptoms to qualify for a specific diagnosis

Test construction

- Scaling - Writing (items) - Scoring - Pilot work

Categorical scaling

- Stimuli (e.g., index cards) are placed into one of two or more alternative categories. - Come up mostly in personality tests - For example, a categorical scale for the political party affiliation of a group of Americans might use 1 to denote Republican, 2 to denote Democrat, and 3 to denote Independent.

Test tryout

- Test should be tried out on the same population that it was designed for - Rule of thumb: 5-10 respondents per item - Should be administered in the same manner, and have the same instructions, as the final product - What is a Good Item? --> A good item is reliable and valid --> A good item discriminates testtakers - high scorers on the test overall answer the item correctly.

Test conceptualization

- The impetus for developing a new test is some thought that "there ought to be a test for..." - The motivation could be a desire to improve psychometric problems with other tests, a new social phenomenon, or new population of interest - There may be a need to assess mastery in an emerging occupation

Item pool

- The set of items from which the final version of the test will be derived - Comprehensive sampling of the construct provides a basis for content validity of the final version of the test.

Rating scale

- Words, statements, or symbols on which testtakers can indicate the strength of a particular trait, attitude, or emotion - Example: Likert scale

Item characteristic curves

- item-characteristic curve is a graphic representation of item difficulty and discrimination --> The steeper the slope, the greater the item discrimination. An item may also vary in terms of its difficulty level. An easy item will shift the ICC to the left along the ability axis, indicating that many people will likely get the item correct. A difficult item will shift the ICC to the right along the horizontal axis, indicating that fewer people will answer the item correctly. - α parameter indicates the relatedness (or the slope) of the item to the latent construct (e.g., marital distress) - b parameter indicates the point on the latent construct where the probability of endorsing the item equals 0.50 while controlling for mean differences along the continuum of marital distress

Five stages of test development

1. Test conceptualization 2. Test construction 3. Test try out 4. Analyses 5. Revision (--> 3)

The "a" parameter tell us something about the ____1_______ of the item. Whereas the "b" parameter tell us something about _____2_____ of the item.

1. relatedness slope discrimination 2. difficulty probability of endorsement at .50 probability of endorsement severity endorsement

Which one of the following is NOT a part of the test construction phase of test development?

Defining your construct of interest - ARE: Writing a test prototype, Piloting the test, Creating an item pool, Determining the scaling method, Selecting your response format

One of the advantages of computerized adaptive testing (CAT) is that

Floor effects are reduced

Test development

The process of creating a test

In item analyses, the d-value is an indication of item discrimination. Which of these would be most concerning if we were evaluating the d-value for our items?

a low negative value

Item branching refers to

administering certain test items on a test depending on the testtakers' responses to previous test items.

The percentage of agreement on a particular item is referred to as the item-___________ index.

endorsement

In the context of psychological test development, pilot work refers to the

preliminary research conducted prior to the stage of test construction.

The idea for a new test may come from

social need, review of the available literature, and common sense appeal (all of these.)

Guttman scales:

typically are constructed so that agreement with one statement may predict agreement with another statement

Preliminary questions for test conceptualization

• What is the test designed to measure? • What is the objective of the test? • Is there a need for this test? • Who will use this test? • Who will take this test? • What content will the test cover? • How will the test be administered? • What is the ideal format of the test? • Should more than one form of the test be developed? • What special training will be required of test users for administering or interpreting the test? • What types of responses will be required of testtakers? • Who benefits from an administration of this test? • Is there any potential for harm as the result of an administration of this test? • How will meaning be attributed to scores on this test?


Ensembles d'études connexes

Heritage Studies 6 Ancient India

View Set

Unit 3 Vocabulary: Fractions and Decimals

View Set

Health Portion Chapter 6 Health Insurance Policy Provisions

View Set

Business Foundations 1203 Exam 2

View Set