EDP301 Midterm

Ace your homework & exams now with Quizwiz!

bias review panel

1. Assemble a panel of individuals from subgroups that might be adversely affected by the test 2. Give panel members thorough orientation and guided practice 3. Discussions of illustrative practice items enhance panel members understanding of assessment bias

Statistical Analysis

1. Evidence is gathered on high-stakes tests to be given to large groups of students 2. Potential bias is detected through differential item functioning (DIF) procedures • Items are identified for which subgroup differences in performance exist • Items with DIF that are also judged to be biased are removed from the test

Education for All Handicapped Children

1975, Recognized the growing need to educate children with disabilities Judicial rulings required states to provide education for students with disabilities Also called Public Law 94-142 (P.L. 94-142)

1. Thou shall not provide opaque directions to students regarding how to respond. 2. Thou shall not employ ambiguous statements. 3. Thou shall not provide students with unintentional clues regarding appropriate responses. 4. Thou shall not employ complex syntax in your items. 5. Thou shall not use vocabulary that is more advanced than required.

5 item-writing commandments for selected response

Criterion-referenced measurement

• Reflects the degree to which curricular aims have been mastered • Absolute interpretation • Hinges on the quality of curricular aims

presentation, response, setting, timing and schedule

Accommodation Types

content related assessment of validity

Adequacy of the assessment's content to measure curricular aims or content standards—knowledge, skills, or attitudes •Concern is the representativeness and relevance of assessment content to content domain •Most important type of evidence for classroom assessments

presentation accommodations

Alternate presentation of material Auditory, multi-sensory, tactile and visual

Analytics Scoring

Assigns points to each factor • Identifies students' strengths and weaknesses • Ignores overall quality of response

Criterion-referenced measurements

Desire high values of p for all items Post-instruction difficulties (p) approach 1 •Desire low discriminations (D) •Two different general approaches to item analysis are used Similar to those used for norm-referenced tests

response accommodations

Complete activities, assignments, and assessments in different ways Use assistive devices

Standards for Educational and Psychological Testing

First published in 1966 by the American Educational Research Association (AERA) Provide guidelines for test development Detailed mandates Evaluation of tests Appropriate uses of tests Often evoked in legal proceedings Updated in 2014

matching

Consists of two parallel lists of words or phrases that students match according to some specified association Entries in the list for which a match is sought are referred to as premises Entries in the list for which a match are made are referred to as responses

High Stakes Testing

Defined as an assessment where important consequences ride on the results Federally required accountability tests are high-stakes tests Decisions regarding students and staffs are influenced by the results The NCLB does not require diploma denial or holding back a student States make those decisions

Internal consistency

Degree of homogeneity in test item Administer the assessment Calculate Kuder-Richardson 20 or Cronbach's coefficient alpha

Criterion-related evidence of validity

Degree to which scores predict performance on some criterion variable Compare scores with another measure of performance (criterion variable) E.g., How well does the SAT predict how well a student will do in college? Often a correlation coefficient is used to measure the strength of the association between variables

Content Standards

Describe the knowledge or skills students need to learn • Most useful standards • Also called "academic content standards"

To construct and evaluate classroom assessments To use and interpret assessments developed by others To plan instruction based on instructionally illuminating assessments, learnt he vocabulary

Education Students must know

Item-discrimination indices procedure Order papers from high to low by total score Divide papers into two groups—a high group and a low group Calculate the p value for each item for the high group and low group • Divide # of students in the high group who answered the item correctly by the # of students in the high group • Repeat for the low group

Empirically Based Item Improvement

alternate-forms

Equivalency of two forms of an assessment Administer two forms of an assessment to the same group Calculate the correlation coefficient between the two sets of responses

Advantages Measure complex learning outcomes Measure more than one outcome •Disadvantages Difficult to write the item properly Difficult to score reliably

Essay Pro Con

Evidence Based on Response Process

Evidence typically comes from analyses of different test-takers responses during the test If measuring logical reasoning abilities, it's important to know if the students are relying on logical reasoning processes as they complete the test • E.g., Questions could be posed at the conclusion of a test to determine the procedures employed by the test takers

Distractor Analysis

Examination of the incorrect options (distractors) for multiple-choice or matching items Determines how high- and low-group students are responding to an item's distractors Calculate difficulty (p) and discrimination (D) for each response alternative Review values for distractors • Are any p values too high or too low? • Are any D values negative? • Are there any patterns in responses that indicate modifications should be made?

IEP

Federally prescribed in P.L. 94-142 •Developed by parents, teachers, and specialists •Describes how a particular child with disabilities should be educated Identifies annual curricular aim and specifies how they will be achieved Specifies services needed by child Identifies assessment modifications

Testing

Final exams, midterms, quizzes • How much did students learn? • Historically paper-and-pencil

. Be clear about the nature of the intended interpretation of scores as they relate to the decisions 2. Come up with propositions that must be supported if those interpretations are going to be accurate 3. Collect relevant evidence 4. Synthesize the whole works into a convincing validity argument

Generating a compelling validity argument

1. Score responses holistically and/or analytically. 2. Prepare a tentative scoring key in advance of judging students' responses. 3. Make decisions regarding the importance of the mechanics of writing prior to scoring. 4. Score all responses to one item before scoring responses to the next item. 5. Insofar as possible, evaluate responses anonymously.

Guidelines for Scoring Essays

Must rely on human judgment to determine instructed and uninstructed groups If the two groups are very different, e.g., intellectual ability, they may differ for other reasons than instruction Can be difficult to isolate the two groups

INstructed/uninstructed group pro con

Performance Standards

Identify the desired level of proficiency for a content standard • Also called "academic achievement standards"

mean

measure of central tendency, Arithmetic average of a set of scores

median

measure of central tendency, Midpoint of a set of scores

Same group—Posttest/Pretest analysis disadvantages Instruction must be completed before securing item analysis Pretest may be reactive • Students can be sensitized to items on the posttest from their experience on the pretest • Posttest becomes function of the instruction and the pretest

Item Analysis for criterion-referenced measurements

1. If any of the items seemed confusing, which ones were they? 2. Did any items have more than one correct answer? If so, which ones? 3. Did any items have no correct answers? If so, which ones? 4. Were there words in any items that confused you? If so, which ones? 5. Were there directions for the test, or for particular subsections, unclear? If so, which one

Item Improvement Questionnaire for students

1. Convey to students a clear idea regarding the extensiveness of the response desired. 2. Construct items so the student's task is explicitly described. 3. Provide students with the approximate time to be expended on each item as well as each item's value. 4. Do not employ optional items. 5. Precursively judge an item's quality by composing, mentally or in writing, a possible response.

Item Writing Guidelines for Essays

1. The stem should consist of a self-contained question or problem. 2. Avoid negatively stated stems. 3. Do not let the length of the alternatives supply unintended clues. 4. Randomly assign correct answers to alternative positions. 5. Never use "all-of-the-above" alternatives, but do use "none-of-the-above" alternatives to increase item difficulty.

Item writing for multiple choice

1. Usually employ direct questions rather than incomplete statements, particularly for young students. 2. Structure the item so that a response should be concise. 3. Place blanks in the margin for direct questions or near the end of incomplete statements. 4. For incomplete statements, use only one or, at most, two blanks. 5. Make sure blanks for all items are equal in length.

Item-Writing Guidelines for Short-Answer Items

1. Phrase items so that a superficial analysis by the student suggests a wrong answer. 2. Rarely use negative statements, and never use double negatives. 3. Include only one concept in each statement. 4. Have an approximately equal number of items representing the two categories being tested. 5. Keep item length similar for both categories being tested.

Item-writing binary response

Provide colleague with brief description of the previously-mentioned criteria Describe the key-inference intended by the test Particularly useful with performance and portfolio assessments Offer to return the favor

Judgement by Colleague

Can be biased • Allow time between test construction & review Consider these five review criteria 1. Do the items adhere to the guidelines and rules specified by the text? 2. Do the items contribute to score-based inferences? 3. Is the content still accurate? 4. Are there gaps in the content (lacunae)? 5.Is the test fair to all?

Judgement by self

Typically an overlooked group of reviewers Students review after completing the test Particularly useful with • Problems with items and directions • Time allowed for completion of the test Use a questionnaire to collect data Expect carping from low-scoring students

Judgement by students

test-retest, alternate forms, internal consistency

Measuring Reliability

Elementary and Secondary Education ACT (ESEA)

Most significant federal legislation •Enacted in 1965 •Reauthorized every two to eight years •Focused on evaluating the progress of underserved students

range

measure of variability, Highest score in the set of scores minus the lowest score

group focused test interpretation

Necessary to describe the performance of a group Measures of central tendency and variability: • Mean • Median • Range • Standard deviation

Performance on accountability tests influence the public's perceptions of educational effectiveness, Federal initiative to use students' scores to evaluate teachers, Tests should not be instructional afterthoughts

New Reasons for Assessment

Curricular Aims

Play a prominent role in assessment choice Curriculum = sought for ends of instruction Instruction = the means teachers employ Teachers need to know the curricular labels of their locale, need to be stated clearly and measure what's really important

external reviews

Possible developmental activities to enhance a high-stakes chemistry test's content representativeness 1. External experts (judges) review content 2. Focus is on the match of the assessment to the content standards 3. E.g., State dept. of education officials construct a statewide assessment 4. A panel of 20 content reviewers (subject- matter experts) considers the test's items

developmental care

Possible developmental activities to enhance a high-stakes chemistry test's content representativeness 1. Panel of content experts makes recommendations 2. The proposed content is systematically contrasted with 5 leading textbooks 3. A group of high school chemistry teachers provides suggestions 4. A group of college professors and state/national associations review the content and offer recommendations and modifications

Accommodations

Procedures to minimize assessment bias in students with disabilities •Practice that permits students with disabilities to have equitable access to instruction and assessment •Goal is to reduce or eliminate distortions in score inferences due to the disability •Must not fundamentally alter the skills or knowledge being assessed

Assessment Bias

Qualities of an assessment that distort students' performance because of characteristics of the students •Characteristics generally refer to group defining characteristics Gender Ethnicity Socioeconomic status Religion offensiveness, unfair penalization,

Elicited responses more closely approximate "real-world" behavior • Seldom is one asked in real life to choose responses from 4 nicely arranged alternatives or give a true-false judgement to a statement! Items typically measure higher-level knowledge and skills

Reasons for Constructed Response

percentiles

Relative interpretation •Compares a student's score with those of other students in the norm group Nationally normed groups Locally normed groups Indicates % of students in the norm group that the student outperformed A percentile of 60 indicates the student performed better than 60% of students in norm group •Most frequently used relative score •Easy to understand •However, their usefulness relies on the quality of norm group

grade-equivalent form

Relative interpretation •Estimates student performance using grade level and months of school year It is a developmental score as it represents a continuous range of grade levels The score is an estimate of the grade level of a student who would obtain that score on that particular test •Most appropriate for basic skills •Two assumptions are implied: The subject area tested is equally emphasized at each grade level Student mastery increases at a constant rate The score is an estimate of the grade level of a student who would obtain that score on that particular test •Most appropriate for basic skills •Two assumptions are implied: The subject area tested is equally emphasized at each grade level Student mastery increases at a constant rate

IDEA

Required curricular expectations for special education students to be consonant with expectations of all students Required students with disabilities in assessment programs and public reporting of results Few negative consequences for noncompliance, therefore most states failed to comply

Instruction must be completed before securing item analysis Pretest may be reactive • Students can be sensitized to items on the posttest from their experience on the pretest • Posttest becomes function of the instruction and the pretest

Same group pre and post test disadvantages

relative interpretation

Score represents students' relative standing within a norm group (Norm group is a group of students that have taken a particular test)

absolute interpretation

Score represents what student can do Score represents degree of mastery

1. Binary-choice items 2. Multiple binary-choice items 3. Multiple-choice items 4. Matching items

Selected Response Types

Common Core State Standards (CCSS)

Set of identical curricular aims for nation's schools English language arts and mathematics •Federal aid is available to states that adopt the standards •Adopted by most states, however, identical curricular aims are unlikely due to state-level revisions

•Advantages Because responses are produced by the student, partial knowledge is not sufficient •Disadvantages Scoring can be difficult Scoring may result in inaccurate representations of students' abilities

Short Answer Pro Con

test-retest

Stability over time, same test administered twice

Judgmental item-improvement procedures • Human judgment is chiefly used Empirical item-improvement procedures • Based on students' responses

Strategies for Item Improvement

Constructed Response

Student constructs the response • Short answer, essay, speeches, soufflé

Selected Response

Student selects a response from a set of responses provided • True-false, matching, multiple-choice

Validity Evidence

Teachers make scads of decisions, on a minute-by-minute basis Classroom assessment can provide reasonably accurate evidence upon which to base those decisions • Observational-based judgements are often off-the-mark content related, criterion related, construct related

Difficulty indices Discrimination indices Distractor analysis

Technique for Evaluating Items

standardized test

Test administered, scored, & interpreted in a standard, predetermined manner •Designed to yield either norm-referenced or criterion-referenced inferences •Constructed primarily of selected-response items •Staggering differences in the level of effort associated with it compared to a classroom test

Classification-Consistency

Test results are used to classify test-takers into categories (e.g., Pass-fail) •A type of test-retest reliability •Measures consistency of the classification

response processes validity

The degree to which the cognitive processes test-takers employ during a test support an interpretation for a specific test use

Computer-Adaptive Assessment

The difficulty of the next item a student is given depends upon the student's ability to answer the previous question correctly Student's mastery can be determined with fewer items Provides a general fix on student status • Precludes the possibility of providing student-specific diagnostic data

test content validity

The extent to which an assessment procedure adequately represents the content of the curricular aims being measured

internal structure validity

The extent to which the internal organization of a test confirms an accurate assessment of the construct supposedly being measured

Measure student's current status, monitor student progress to determine if instructional changes need to be made,assign grades, determine instructional effectiveness

Traditional Reasons for Assessment

Judgemental item improvement

Use human judgment to improve test items •Three sources of judgments Self Colleagues Students human judgement generally used

Holistic Scoring

• Focuses on the entire response as a whole • Reflects overall performance For scoring a composition intended to reflect students' writing prowess: • Organization • Communicative Clarity • Adaptation to Audience • Word Choice • Mechanics (spelling, capitalization, punctuation)

1. Decision focus—Are clearly explicated decision options directly linked to a test's results? 2. Number of assessment targets—Is the number of proposed targets for a test sufficiently small so they represent an instructionally manageable number? 3. Assessment domain emphasized—Will the assessments to be built focus on the cognitive domain, the affective domain, or the psychomotor domain? 4. Norm-referencing and/or criterion-referencing—Will the score-based inferences to be based on students' test performances be norm-referenced, criterion referenced, or both? 5. Selected versus constructed response mode—Can students' responses be selected-responses, constructed-responses, or both? 6. Relevant curricular configurations—Will students' performance on the assessment contribute to mastery of state's officially approved curricular aims? 7. National subject-matter organization recommendations—Are the knowledge, skills, and/or affect assessed in line with the curricular recommendations of national subject-matter organizations? 8. NAEP assessment frameworks—If applicable, is what the assessment measures similar to what is measured by NAEP? 9. Collegial input—If available, has a knowledgeable colleague reacted to the proposed assessment?

What to Assess Considerations

Affective assessments

attitudes, interests, values

distractor analysis

calculated by examining proportions to incorrect options (distractors)

Item difficulties

calculated for each item using proportion correct

item discrimination

calculated for each item using proportion of an ability group correct

Norm-referenced measurement

• Interpret performance in relation to the group, the norm • Relative interpretation • Used with aptitude or standardized achievement tests

Disparate Impact

does not necessarily = assessment bias If a test has a disparate impact on a particular racial, gender, or religious subgroup, then close scrutiny is warranted Not biased if true differences in ability exist Bias is present if the test offends or unfairly penalizes members of a subgroup

a student is above or below their actual grade level, it just describes the level of the test they are performing to I.e.for example, a 3rd grader who gets a score of 5.5 in reading: Does not mean the student can do work at a 5th grade level Does not mean the student should be promoted to 5th grade Does mean the 3rd grader understands the reading skills covered on the test as well as a fifth grader at midyear

grade equivalent scores do not mean

cognitive assessments

intellectual operations

Psychomotor assessments

large and small muscles

the larger the Standard Error of Measurement

the larger the standard deviation

Vaildity

the most fundamental consideration in developing & evaluating tests, should be determined by the intended use of the test

Reliability

• Traditional reliability coefficient • Represented by r General notion of score consistency across instances of testing • Indicators of classification consistency • Standard error of measurement Correlation coefficient indicates the strength of a linear relationship (r) -1 to 1 indicates how well a test is measuring what it says it's measuring

matching pro's and con's

•Advantages Compact Efficient Easy to construct Easy to score •Disadvantages Encourage memorization of low-level factual information Only have to select correct response

multiple choice

•Advantages Very common type of item Measure knowledge or skill at a higher level Answers can differ by relative correctness •Disadvantages Only need to recognize the correct answers

Alignment

•Do the tests "properly measure" students' status with respect to the curricular targets? Groups of assessment/curricular specialists have sprung up since the No Child Left Behind Act

No Child Left Behind (NCLB)

•Enacted in 2002 •Eighth reauthorization of ESEA •Focuses on evaluating the progress of all students Dominant function is accountability Changed the way teachers view assessment •Scheduled for revision since 2009 Focus has shifted

Evidence based on a test's internal structure

•Evidence that a test's items measure the number of constructs it intends to measure A construct is an underlying trait that is responsible for some observable behavior E.g., A test designed to measure one construct, overall mathematical ability, measures that one construct Most often used by those who create and use psychologically-focused tests

consequential validity

•Information about the consequences of assessment use •Are the uses of the assessment results valid? •Evaluate the effects of use of assessment results on teachers and students

Standard Error of Measurement

•Measures consistency of individual's score The lower the SEM the more consistent the scores Interpreted in a manner similar to sampling error •Estimates the amount of variability in an individual score if the test was administered to the individual many times calculated using standard deviation

scale scores

•Relative interpretation •Arbitrarily chosen scale to represent student performance •Converted raw scores •Often used to describe group test performances at the state, district, and school level •Can be used to make direct comparisons between groups Classroom Assessment: What Teachers Need to Know, 8e •Useful for developing equally difficult forms of the same test

Essay

•Response is of a paragraph or more in length •Measure student's ability to synthesize, evaluate, and compose •Typically used to measure complex learning outcomes

Short Answer

•Student responds using a word, phrase or sentence •Can be a response to either a direct question or incomplete statement •Typically measure relatively simple learning outcomes

Assessment

•Systematic ways to get a fix on students' status Embraces alternative ways to evaluate learning outcomes •Broad, nonrestrictive label for the kinds of testing and measuring teachers must do

Decision-Driven Assessment

•Teachers make interpretations about students' status Score-based interpretations Interpretations lead to decisions Decisions should influence instruction •Classroom instruction should focus on decisions made based on assessment


Related study sets

Religion: Parables, Miracles and Table-fellowship

View Set

Geometry B, Assignment 7. Meaning of Similarity - Proofs

View Set

PSYC 101 Exam 6, Social Psychology

View Set

Anatomy and Physiology Review Quiz

View Set

PVN 106 Pharm EAQ Practice Test: CV Meds

View Set