EDP301 Midterm
bias review panel
1. Assemble a panel of individuals from subgroups that might be adversely affected by the test 2. Give panel members thorough orientation and guided practice 3. Discussions of illustrative practice items enhance panel members understanding of assessment bias
Statistical Analysis
1. Evidence is gathered on high-stakes tests to be given to large groups of students 2. Potential bias is detected through differential item functioning (DIF) procedures • Items are identified for which subgroup differences in performance exist • Items with DIF that are also judged to be biased are removed from the test
Education for All Handicapped Children
1975, Recognized the growing need to educate children with disabilities Judicial rulings required states to provide education for students with disabilities Also called Public Law 94-142 (P.L. 94-142)
1. Thou shall not provide opaque directions to students regarding how to respond. 2. Thou shall not employ ambiguous statements. 3. Thou shall not provide students with unintentional clues regarding appropriate responses. 4. Thou shall not employ complex syntax in your items. 5. Thou shall not use vocabulary that is more advanced than required.
5 item-writing commandments for selected response
Criterion-referenced measurement
• Reflects the degree to which curricular aims have been mastered • Absolute interpretation • Hinges on the quality of curricular aims
presentation, response, setting, timing and schedule
Accommodation Types
content related assessment of validity
Adequacy of the assessment's content to measure curricular aims or content standards—knowledge, skills, or attitudes •Concern is the representativeness and relevance of assessment content to content domain •Most important type of evidence for classroom assessments
presentation accommodations
Alternate presentation of material Auditory, multi-sensory, tactile and visual
Analytics Scoring
Assigns points to each factor • Identifies students' strengths and weaknesses • Ignores overall quality of response
Criterion-referenced measurements
Desire high values of p for all items Post-instruction difficulties (p) approach 1 •Desire low discriminations (D) •Two different general approaches to item analysis are used Similar to those used for norm-referenced tests
response accommodations
Complete activities, assignments, and assessments in different ways Use assistive devices
Standards for Educational and Psychological Testing
First published in 1966 by the American Educational Research Association (AERA) Provide guidelines for test development Detailed mandates Evaluation of tests Appropriate uses of tests Often evoked in legal proceedings Updated in 2014
matching
Consists of two parallel lists of words or phrases that students match according to some specified association Entries in the list for which a match is sought are referred to as premises Entries in the list for which a match are made are referred to as responses
High Stakes Testing
Defined as an assessment where important consequences ride on the results Federally required accountability tests are high-stakes tests Decisions regarding students and staffs are influenced by the results The NCLB does not require diploma denial or holding back a student States make those decisions
Internal consistency
Degree of homogeneity in test item Administer the assessment Calculate Kuder-Richardson 20 or Cronbach's coefficient alpha
Criterion-related evidence of validity
Degree to which scores predict performance on some criterion variable Compare scores with another measure of performance (criterion variable) E.g., How well does the SAT predict how well a student will do in college? Often a correlation coefficient is used to measure the strength of the association between variables
Content Standards
Describe the knowledge or skills students need to learn • Most useful standards • Also called "academic content standards"
To construct and evaluate classroom assessments To use and interpret assessments developed by others To plan instruction based on instructionally illuminating assessments, learnt he vocabulary
Education Students must know
Item-discrimination indices procedure Order papers from high to low by total score Divide papers into two groups—a high group and a low group Calculate the p value for each item for the high group and low group • Divide # of students in the high group who answered the item correctly by the # of students in the high group • Repeat for the low group
Empirically Based Item Improvement
alternate-forms
Equivalency of two forms of an assessment Administer two forms of an assessment to the same group Calculate the correlation coefficient between the two sets of responses
Advantages Measure complex learning outcomes Measure more than one outcome •Disadvantages Difficult to write the item properly Difficult to score reliably
Essay Pro Con
Evidence Based on Response Process
Evidence typically comes from analyses of different test-takers responses during the test If measuring logical reasoning abilities, it's important to know if the students are relying on logical reasoning processes as they complete the test • E.g., Questions could be posed at the conclusion of a test to determine the procedures employed by the test takers
Distractor Analysis
Examination of the incorrect options (distractors) for multiple-choice or matching items Determines how high- and low-group students are responding to an item's distractors Calculate difficulty (p) and discrimination (D) for each response alternative Review values for distractors • Are any p values too high or too low? • Are any D values negative? • Are there any patterns in responses that indicate modifications should be made?
IEP
Federally prescribed in P.L. 94-142 •Developed by parents, teachers, and specialists •Describes how a particular child with disabilities should be educated Identifies annual curricular aim and specifies how they will be achieved Specifies services needed by child Identifies assessment modifications
Testing
Final exams, midterms, quizzes • How much did students learn? • Historically paper-and-pencil
. Be clear about the nature of the intended interpretation of scores as they relate to the decisions 2. Come up with propositions that must be supported if those interpretations are going to be accurate 3. Collect relevant evidence 4. Synthesize the whole works into a convincing validity argument
Generating a compelling validity argument
1. Score responses holistically and/or analytically. 2. Prepare a tentative scoring key in advance of judging students' responses. 3. Make decisions regarding the importance of the mechanics of writing prior to scoring. 4. Score all responses to one item before scoring responses to the next item. 5. Insofar as possible, evaluate responses anonymously.
Guidelines for Scoring Essays
Must rely on human judgment to determine instructed and uninstructed groups If the two groups are very different, e.g., intellectual ability, they may differ for other reasons than instruction Can be difficult to isolate the two groups
INstructed/uninstructed group pro con
Performance Standards
Identify the desired level of proficiency for a content standard • Also called "academic achievement standards"
mean
measure of central tendency, Arithmetic average of a set of scores
median
measure of central tendency, Midpoint of a set of scores
Same group—Posttest/Pretest analysis disadvantages Instruction must be completed before securing item analysis Pretest may be reactive • Students can be sensitized to items on the posttest from their experience on the pretest • Posttest becomes function of the instruction and the pretest
Item Analysis for criterion-referenced measurements
1. If any of the items seemed confusing, which ones were they? 2. Did any items have more than one correct answer? If so, which ones? 3. Did any items have no correct answers? If so, which ones? 4. Were there words in any items that confused you? If so, which ones? 5. Were there directions for the test, or for particular subsections, unclear? If so, which one
Item Improvement Questionnaire for students
1. Convey to students a clear idea regarding the extensiveness of the response desired. 2. Construct items so the student's task is explicitly described. 3. Provide students with the approximate time to be expended on each item as well as each item's value. 4. Do not employ optional items. 5. Precursively judge an item's quality by composing, mentally or in writing, a possible response.
Item Writing Guidelines for Essays
1. The stem should consist of a self-contained question or problem. 2. Avoid negatively stated stems. 3. Do not let the length of the alternatives supply unintended clues. 4. Randomly assign correct answers to alternative positions. 5. Never use "all-of-the-above" alternatives, but do use "none-of-the-above" alternatives to increase item difficulty.
Item writing for multiple choice
1. Usually employ direct questions rather than incomplete statements, particularly for young students. 2. Structure the item so that a response should be concise. 3. Place blanks in the margin for direct questions or near the end of incomplete statements. 4. For incomplete statements, use only one or, at most, two blanks. 5. Make sure blanks for all items are equal in length.
Item-Writing Guidelines for Short-Answer Items
1. Phrase items so that a superficial analysis by the student suggests a wrong answer. 2. Rarely use negative statements, and never use double negatives. 3. Include only one concept in each statement. 4. Have an approximately equal number of items representing the two categories being tested. 5. Keep item length similar for both categories being tested.
Item-writing binary response
Provide colleague with brief description of the previously-mentioned criteria Describe the key-inference intended by the test Particularly useful with performance and portfolio assessments Offer to return the favor
Judgement by Colleague
Can be biased • Allow time between test construction & review Consider these five review criteria 1. Do the items adhere to the guidelines and rules specified by the text? 2. Do the items contribute to score-based inferences? 3. Is the content still accurate? 4. Are there gaps in the content (lacunae)? 5.Is the test fair to all?
Judgement by self
Typically an overlooked group of reviewers Students review after completing the test Particularly useful with • Problems with items and directions • Time allowed for completion of the test Use a questionnaire to collect data Expect carping from low-scoring students
Judgement by students
test-retest, alternate forms, internal consistency
Measuring Reliability
Elementary and Secondary Education ACT (ESEA)
Most significant federal legislation •Enacted in 1965 •Reauthorized every two to eight years •Focused on evaluating the progress of underserved students
range
measure of variability, Highest score in the set of scores minus the lowest score
group focused test interpretation
Necessary to describe the performance of a group Measures of central tendency and variability: • Mean • Median • Range • Standard deviation
Performance on accountability tests influence the public's perceptions of educational effectiveness, Federal initiative to use students' scores to evaluate teachers, Tests should not be instructional afterthoughts
New Reasons for Assessment
Curricular Aims
Play a prominent role in assessment choice Curriculum = sought for ends of instruction Instruction = the means teachers employ Teachers need to know the curricular labels of their locale, need to be stated clearly and measure what's really important
external reviews
Possible developmental activities to enhance a high-stakes chemistry test's content representativeness 1. External experts (judges) review content 2. Focus is on the match of the assessment to the content standards 3. E.g., State dept. of education officials construct a statewide assessment 4. A panel of 20 content reviewers (subject- matter experts) considers the test's items
developmental care
Possible developmental activities to enhance a high-stakes chemistry test's content representativeness 1. Panel of content experts makes recommendations 2. The proposed content is systematically contrasted with 5 leading textbooks 3. A group of high school chemistry teachers provides suggestions 4. A group of college professors and state/national associations review the content and offer recommendations and modifications
Accommodations
Procedures to minimize assessment bias in students with disabilities •Practice that permits students with disabilities to have equitable access to instruction and assessment •Goal is to reduce or eliminate distortions in score inferences due to the disability •Must not fundamentally alter the skills or knowledge being assessed
Assessment Bias
Qualities of an assessment that distort students' performance because of characteristics of the students •Characteristics generally refer to group defining characteristics Gender Ethnicity Socioeconomic status Religion offensiveness, unfair penalization,
Elicited responses more closely approximate "real-world" behavior • Seldom is one asked in real life to choose responses from 4 nicely arranged alternatives or give a true-false judgement to a statement! Items typically measure higher-level knowledge and skills
Reasons for Constructed Response
percentiles
Relative interpretation •Compares a student's score with those of other students in the norm group Nationally normed groups Locally normed groups Indicates % of students in the norm group that the student outperformed A percentile of 60 indicates the student performed better than 60% of students in norm group •Most frequently used relative score •Easy to understand •However, their usefulness relies on the quality of norm group
grade-equivalent form
Relative interpretation •Estimates student performance using grade level and months of school year It is a developmental score as it represents a continuous range of grade levels The score is an estimate of the grade level of a student who would obtain that score on that particular test •Most appropriate for basic skills •Two assumptions are implied: The subject area tested is equally emphasized at each grade level Student mastery increases at a constant rate The score is an estimate of the grade level of a student who would obtain that score on that particular test •Most appropriate for basic skills •Two assumptions are implied: The subject area tested is equally emphasized at each grade level Student mastery increases at a constant rate
IDEA
Required curricular expectations for special education students to be consonant with expectations of all students Required students with disabilities in assessment programs and public reporting of results Few negative consequences for noncompliance, therefore most states failed to comply
Instruction must be completed before securing item analysis Pretest may be reactive • Students can be sensitized to items on the posttest from their experience on the pretest • Posttest becomes function of the instruction and the pretest
Same group pre and post test disadvantages
relative interpretation
Score represents students' relative standing within a norm group (Norm group is a group of students that have taken a particular test)
absolute interpretation
Score represents what student can do Score represents degree of mastery
1. Binary-choice items 2. Multiple binary-choice items 3. Multiple-choice items 4. Matching items
Selected Response Types
Common Core State Standards (CCSS)
Set of identical curricular aims for nation's schools English language arts and mathematics •Federal aid is available to states that adopt the standards •Adopted by most states, however, identical curricular aims are unlikely due to state-level revisions
•Advantages Because responses are produced by the student, partial knowledge is not sufficient •Disadvantages Scoring can be difficult Scoring may result in inaccurate representations of students' abilities
Short Answer Pro Con
test-retest
Stability over time, same test administered twice
Judgmental item-improvement procedures • Human judgment is chiefly used Empirical item-improvement procedures • Based on students' responses
Strategies for Item Improvement
Constructed Response
Student constructs the response • Short answer, essay, speeches, soufflé
Selected Response
Student selects a response from a set of responses provided • True-false, matching, multiple-choice
Validity Evidence
Teachers make scads of decisions, on a minute-by-minute basis Classroom assessment can provide reasonably accurate evidence upon which to base those decisions • Observational-based judgements are often off-the-mark content related, criterion related, construct related
Difficulty indices Discrimination indices Distractor analysis
Technique for Evaluating Items
standardized test
Test administered, scored, & interpreted in a standard, predetermined manner •Designed to yield either norm-referenced or criterion-referenced inferences •Constructed primarily of selected-response items •Staggering differences in the level of effort associated with it compared to a classroom test
Classification-Consistency
Test results are used to classify test-takers into categories (e.g., Pass-fail) •A type of test-retest reliability •Measures consistency of the classification
response processes validity
The degree to which the cognitive processes test-takers employ during a test support an interpretation for a specific test use
Computer-Adaptive Assessment
The difficulty of the next item a student is given depends upon the student's ability to answer the previous question correctly Student's mastery can be determined with fewer items Provides a general fix on student status • Precludes the possibility of providing student-specific diagnostic data
test content validity
The extent to which an assessment procedure adequately represents the content of the curricular aims being measured
internal structure validity
The extent to which the internal organization of a test confirms an accurate assessment of the construct supposedly being measured
Measure student's current status, monitor student progress to determine if instructional changes need to be made,assign grades, determine instructional effectiveness
Traditional Reasons for Assessment
Judgemental item improvement
Use human judgment to improve test items •Three sources of judgments Self Colleagues Students human judgement generally used
Holistic Scoring
• Focuses on the entire response as a whole • Reflects overall performance For scoring a composition intended to reflect students' writing prowess: • Organization • Communicative Clarity • Adaptation to Audience • Word Choice • Mechanics (spelling, capitalization, punctuation)
1. Decision focus—Are clearly explicated decision options directly linked to a test's results? 2. Number of assessment targets—Is the number of proposed targets for a test sufficiently small so they represent an instructionally manageable number? 3. Assessment domain emphasized—Will the assessments to be built focus on the cognitive domain, the affective domain, or the psychomotor domain? 4. Norm-referencing and/or criterion-referencing—Will the score-based inferences to be based on students' test performances be norm-referenced, criterion referenced, or both? 5. Selected versus constructed response mode—Can students' responses be selected-responses, constructed-responses, or both? 6. Relevant curricular configurations—Will students' performance on the assessment contribute to mastery of state's officially approved curricular aims? 7. National subject-matter organization recommendations—Are the knowledge, skills, and/or affect assessed in line with the curricular recommendations of national subject-matter organizations? 8. NAEP assessment frameworks—If applicable, is what the assessment measures similar to what is measured by NAEP? 9. Collegial input—If available, has a knowledgeable colleague reacted to the proposed assessment?
What to Assess Considerations
Affective assessments
attitudes, interests, values
distractor analysis
calculated by examining proportions to incorrect options (distractors)
Item difficulties
calculated for each item using proportion correct
item discrimination
calculated for each item using proportion of an ability group correct
Norm-referenced measurement
• Interpret performance in relation to the group, the norm • Relative interpretation • Used with aptitude or standardized achievement tests
Disparate Impact
does not necessarily = assessment bias If a test has a disparate impact on a particular racial, gender, or religious subgroup, then close scrutiny is warranted Not biased if true differences in ability exist Bias is present if the test offends or unfairly penalizes members of a subgroup
a student is above or below their actual grade level, it just describes the level of the test they are performing to I.e.for example, a 3rd grader who gets a score of 5.5 in reading: Does not mean the student can do work at a 5th grade level Does not mean the student should be promoted to 5th grade Does mean the 3rd grader understands the reading skills covered on the test as well as a fifth grader at midyear
grade equivalent scores do not mean
cognitive assessments
intellectual operations
Psychomotor assessments
large and small muscles
the larger the Standard Error of Measurement
the larger the standard deviation
Vaildity
the most fundamental consideration in developing & evaluating tests, should be determined by the intended use of the test
Reliability
• Traditional reliability coefficient • Represented by r General notion of score consistency across instances of testing • Indicators of classification consistency • Standard error of measurement Correlation coefficient indicates the strength of a linear relationship (r) -1 to 1 indicates how well a test is measuring what it says it's measuring
matching pro's and con's
•Advantages Compact Efficient Easy to construct Easy to score •Disadvantages Encourage memorization of low-level factual information Only have to select correct response
multiple choice
•Advantages Very common type of item Measure knowledge or skill at a higher level Answers can differ by relative correctness •Disadvantages Only need to recognize the correct answers
Alignment
•Do the tests "properly measure" students' status with respect to the curricular targets? Groups of assessment/curricular specialists have sprung up since the No Child Left Behind Act
No Child Left Behind (NCLB)
•Enacted in 2002 •Eighth reauthorization of ESEA •Focuses on evaluating the progress of all students Dominant function is accountability Changed the way teachers view assessment •Scheduled for revision since 2009 Focus has shifted
Evidence based on a test's internal structure
•Evidence that a test's items measure the number of constructs it intends to measure A construct is an underlying trait that is responsible for some observable behavior E.g., A test designed to measure one construct, overall mathematical ability, measures that one construct Most often used by those who create and use psychologically-focused tests
consequential validity
•Information about the consequences of assessment use •Are the uses of the assessment results valid? •Evaluate the effects of use of assessment results on teachers and students
Standard Error of Measurement
•Measures consistency of individual's score The lower the SEM the more consistent the scores Interpreted in a manner similar to sampling error •Estimates the amount of variability in an individual score if the test was administered to the individual many times calculated using standard deviation
scale scores
•Relative interpretation •Arbitrarily chosen scale to represent student performance •Converted raw scores •Often used to describe group test performances at the state, district, and school level •Can be used to make direct comparisons between groups Classroom Assessment: What Teachers Need to Know, 8e •Useful for developing equally difficult forms of the same test
Essay
•Response is of a paragraph or more in length •Measure student's ability to synthesize, evaluate, and compose •Typically used to measure complex learning outcomes
Short Answer
•Student responds using a word, phrase or sentence •Can be a response to either a direct question or incomplete statement •Typically measure relatively simple learning outcomes
Assessment
•Systematic ways to get a fix on students' status Embraces alternative ways to evaluate learning outcomes •Broad, nonrestrictive label for the kinds of testing and measuring teachers must do
Decision-Driven Assessment
•Teachers make interpretations about students' status Score-based interpretations Interpretations lead to decisions Decisions should influence instruction •Classroom instruction should focus on decisions made based on assessment