Intro to Assessment

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

ITEM RELIABILITY

the extent to which we can generalize to different samples of items. Alternate-Form Reliability: represents the correlation between scores for the same individuals on two different forms of a test. ● They measure the same trait or skill to the same extent and are standardized on the same population. ● Offer equivalent tests (but not identical) ● The means and variances for the alternate forms are assumed to be (or should be) the same. ● Potential problems: ○ Can be difficult to create alternate forms ○ If the interval between administration is too long, error will be inflated. ● How its done: A large sample of students is tested with two forms. Half the subjects receive form A then form B, the other half received form B then form A. Scores from the two forms are correlated and that equals the reliability coefficient. Internal Consistency: a measure of the extent to which items in a test correlate with one another. ● Does not require two or more test forms ● Limitations: ● Should not be used with timed tests ● Does not inform us about stability over time. Looking at to what extent do both halves measure the same trait: ● Split-Half Reliability Estimate: Involves creating two equivalent halves of an assessment instrument. ○ How its done: Test has 10 items, after giving the test, we divide the test into two 5-tiem tests by summing the even numbered items and the odd numbered items for each student. This creates two alternate forms of the test, each containing one half of the total number of items. We can then correlate the sum of the odd-numbered items with the sums of the even-numbered items to obtain a reliability estimate. ○ Pearson's R will underestimate the reliability of the test. Thus, R is corrected by using the Spearman Brown formula. Looking at to what extent do all the items of the instrument (and/or subscale or subtest) measure the same trait? ● Coefficient (Cronbach's) Alpha: is the average split-half correlation based on all possible divisions of a test into two parts. ● In practice there is no need to compute all possible correlation coefficients; coefficient alpha can be computed from the variances of individual test items and the variance of the total test scores. ● Can be used when test items are scores pass-fail or when more than one point is awarded for a correct response. ● Helpful because you are looking at each item individually so you can see if one item in particular is not measuring the same trait well, and take that item out to increase reliability. ● Kuder and Richardson, KR-20: is coefficient alpha for dichotomously scored test items (items that can only be scored right or wrong)

Nominal

● Assignment of a number or a name for the purpose of designating mutually exclusive categories that represent different characteristics ● The weakest and least precise scale of measurement ● If a numbers are used, adjacent values have no inherent relationship with one another. There is no implied rank ordering in the numbers. ● Because of limitations, nominal scales are not often used in assessment instruments ● Use it when we don't have a choice (ex. Gender, diagnostic categories) or when a more precise measurement is not needed (ex. A students preference for a reward) ● All we can do with nominal scales is determine the frequency of each value. Doesn't determine level ○ Ex. Favorite flavor of ice cream: We can count frequencies within these categories. However, it is nonsensical to calculate a statistical mean within or across categories. We may know that someone prefer strawberry ice cream, but that alone does not tell us how much he/she likes it.

RELIABILITY COEFFICIENT

● Item Reliability ○ Alternate-Form Reliability ○ Internal Consistency ■ Split-Half Relibaility Estimate ■ Coefficient (Cronbach's Alpha) ■ Kuder and Richardson, KR-20 ● Stability ○ Test-Retest Reliability ● Inter-observer Agreement ○ Correlational Approach ○ Percentage of Agreement Approach ■ Simple Agreement ■ Point-To-Point Agreement ■ Agreement of Occurrence ■ Cohen's Kappa

Ratio Scales

● No absolute zero in intelligence tests ● Have all the characteristics of interval scales and also have an absolute zero. ○ Ex. Distance, speed, weight, height ● Absolute zero permits the calculation of a ratio ○ Ex. Someone is twice as tall as someone else. ● Psychology and education rarely makes use of ratio scales. Difficult to determine when someone has a complete lack of something. ○ Ex. If person X scored 60 on an intelligence test, and person y scored 120, we can't infer that person y is twice as intelligence as person x

Assessment as a Problem Solving Process

● Problem-solving is a systematic decision-making process to understanding and intervening with problems. ● The process as applied to the practice of psychology has adapted its basic concepts from the process of scientific research. ○ In research, the scientific method has been useful for improving our understanding of "the world" and developing new technologies. ○ It is a way of thinking about problems and an important habit of the mind to cultivate. ○ As professionals (not technicians), we are obligated to adapt the process and findings of research to practice. ○ What is the student need that we are trying to solve. Everything follows how you see the initial problem. ● Scientific Method: ○ Recognize that there is a problem ○ Clarifying the problem, What questions should be answered? ○ Develop a hypothesis and a plan ○ Based upon the data collected, decide on the accuracy of the hypothesis ○ Interpret and generalize the finding (results) ● Example of the problem solving process as applied to Psychological Assessments: ○ Problem clarification: Make an analysis of referral and if needed reframe the presenting problem. Are there clear questions? What information do I need to answer the questions? If you don't clearly understand the concerns you can't pick the method ○ Planning: Develop possible hypotheses about the concerns/needs. Specify the goals of the assessment ○ Development: Identify data collection methods and resources for implementing the assessment ○ Implementation: conduct the assessment ○ Outcome Determination: Did I obtain the information to answer the referral questions? ○ Dissemination: Communicate useful information to those concerned with the assessment. Procedures are also specified for a follow up. The Ecological Context of Assessment: in interpreting and reporting the results of assessments, to what extent has the assessor taken into account the influences of: ● Family ● Culture ● Language ● Community ● School setting ● Classroom setting ● Curriculum ● Teacher's skills: teaching, behavior management, interpersonal ● Peers ● Developmental History ● Physical Abilities and overall health ● Prescription and non-prescription drugs Concerns of Assessors: ● Accuracy ○ Did the assessor correctly categorize the behavior? ○ Can be compromised when ■ The behavior is difficult to observe ■ We attempt to observe too many behaviors at one time ■ Decision rules are too complex or are unclear ● Generalizability ○ Is the behavior being assessed representative of the student's functioning? ○ Three types of generalizability ■ To a larger domain ■ To other times ■ To other settings ● Meaning ○ What interpretations/inferences are valid? ○ How much of the performance should be attributed to culture as opposed to ability? ○ Is there a mismatch between the curriculum and the content on the test? ○ Did a disability interfere with the student's performance? ○ How diverse was the standardization sample? ● Utility ○ How useful is the assessment information in making decisions? ○ Efficiency - speed and cost of data collection ■ Example: Group vs. individually administered tests ● Sensitivity ○ The ability of the instrument to detect small differences between students, and within the same student at different times. ○ The latter aspects of sensitivity is particularly important in monitoring a student's progress Social Validity Of Assessment: access to and satisfaction with the assessment ● Parents and students: access and disposition ● Assessors: Ease of administration and utility ● Administrators: Money and risk management ● Teachers: utility in the classroom and time Important Assessment Concepts to Understand (ch 1): ● Level vs. rate of progress ○ Instructional decision making can be best informed by knowing both a students current level of performance and their rate of progress ○ Determination of performance level can be important for making decisions about what to teach, as well as deciding whether a student has mastered a skill. But information on rate of progress is needed to know whether instruction is particularly effective. ● Different decisions often require different data ○ Decisions may have major implications for students. They should be informed by data that are collected carefully over time and that have strong evidence of reliability (measure consistently) and validity (measure what they propose to measure). Although data with strong technical characteristics are desirable they are not always necessary especially when time is important ● Different methods may be needed for different students ○ Characteristics of how the test is administered, how those being tested are expected to provide their responses, and characteristics of the norm group to whom students are compared may influence the extent to which a given test is appropriate for a particular student. ○ One must be careful to either identify and use tests that are more appropriate for students with the given characteristics consider accommodations that might be made to allow the test to be more appropriate for the given student, or use alternative methods of assessment. ● Different skills often require different methods ○ Four primary methods used for collecting data on students academic and social emotional skills: record review, interview, observation, and testing. ● Only Present behavior is observed ○ When students take tests we only observe what they do. We never observe what they can do. ● High and Low Inference Decision Making. ○ Inference making is particularly evident and potential problematic when A) there are only a few items or tasks that sample a particular behavior or skill of interest and b) the skills needed to complete the items and tasks do not adequately reflect the skills targeted for measurement. ○ One should avoid using assessment tools that have a high level of inference making. May misrepresent the students actual skills, lead to conclusions that are not helpful for informing instruction. ● Accuracy in collecting, scoring, interpreting, and communicating assessment information ○ Even before data is collected it is important to clarify how they will be used and ensure that the use of the given data for the given purpose is justified. ● Fairness is Paramount. ○ Choosing tests that are technically adequate and that are relevant to improved instructional outcomes, always taking into account the nature of the students social and cultural backgrounds, learning histories and opportunities to learn, and always being sensitive to individual differences and disabilities. ● Assessment that Matters ○ Assessment that is related to and supports the development of effective interventions is worthwhile and clearly in the best interests of the individual, families, schools, and communities. ● Assessment practices are dynamic.

E) Progress Monitoring

● Progress Monitoring: is the student making adequate progress toward individual goals and state or common core standards. ● Regardless of tier we must ask- does the student make progress and how do we adapt our teaching? ● Helps us to determine whether if the student needs to catch up ● Data collected for the purpose of making decisions about what to teach and the level at which to teach it. ● Individualized educational programs IEPs help for individual goals ● Common Core State Standards for everyone (students with significant disabilities might have separate goals for the state) ● If tests are used, there should be a close relationship between the test items (content) and what is being taught (curriculum). ● A distinct feature of progress-monitoring methods is that educators evaluate student performance on material that represents the skills or general outcome that students should achieve. Teachers should be sure that the measures represent overall student proficiency in the curriculum. ○ Ex. Oral reading fluency: or the number of words read correctly in 1 min, serves as a robust indicator of overall reading achievement.

TECHNICAL CONSIDERATIONS

● Sampling Plan: finding communities of specific sizes within geographic regions. ○ Cluster Sampling: urban areas and the surrounding suburban and rural communities are selected. Such sampling plans have the advantage of requiring fewer testers and less travel. ○ Selection of Representative Communities: A representative community is usually defined as one in which the mean demographic characteristics (such as educational level and income) of residents are approximately the same as the national or regional average. ○ Neither of the above guarantees that the participants as a group are representative of the population. ● Test authors may adjust norms to make them representative ○ Oversample Subjects (to select many more subjects than are needed) and then drop subjects until a representative sample have been achieved. ○ Weight subjects within the normative sample differentially. Subjects with underrepresented characteristics may be counted as more than one subject, and subjects with overrepresented characteristics may be counted as fractions of a person. ○ Assessment developers often use 'minor' smoothing of norms to approximate a normal distribution. Minor smoothing is believed to result in better estimates of derived scores, means, and standard deviations. ■ Dropping outliers ■ Ensuring consistent progression of means based on age ■ Transforming areas of the distribution by assigning a standard score on the basis of the relationship between standard scores and percentiles found in normal distributions. ● It is time consuming and expensive to establish norms ● Most tests have multiple normative samples that are collectively referred to as the norm sample. ○ Sample in each norm group should be large enough to guarantee stability, represent infrequent characteristics (i.e. specific learning disabilities), and have a full range of derived scores. ○ 100 participants in each normative group are considered the minimum. ○ It is crucial that various kinds of people in each normative sample be included in the same proportion as they occur in the general population. ■ Ex. A test might have 2600 students, but then divide those students by grade, age, gender, etc. So you are not comparing a 4th grader with all 2600 students, just the ones in their sample. ● Age norms must be updated: ○ Sample must represent the current population since levels of skill and ability change over time. ○ Intellectual and educational performances have increased from generation to generation, although these increases are neither steady nor linear. ○ Old norms tend to overestimate a student's relative standing in the population mistakenly because the old norms are too easy. ○ Norms be updated at least: ■ Every 7 years for achievement and intelligence tests ■ Every 15 years for other assessments ○ Updating norms without updating content can be problematic since general knowledge changes from generation to generation. ■ Ex. Massive IQ Gains over time: "Data on IQ trends now exist for 30 nations. Gains differ as a function of the degree of modernity that characterizes different nations. For nations that were fully modern by the beginning of the 20th century, IQ gains have been on the order of 3 points per decade. Nations that have recently begun to modernize, such as Kenya and the caribbean nations show extremely high rates of gain." ● Crystallized Intelligence g(C) - the individual's store of knowledge about the nature of the world and learned operations such as arithmetical ones which can be drawn on in solving problems ● Fluid Intelligence g(F) - the ability to solve novel problems that depend relatively little on stored knowledge as well as the ability to learn ● In recent decades, IQ point gains have been much greater for tasks requiring fluid abilities (about 18 points) than for tasks requiring crystallized abilities (about 10 points in IQ) ● Specialized Norms: refers to all comparisons that are not national ○ Sometimes educators are interested in comparing a test taker to a particular subgroup of the national population. ○ Local Norms: based on an entire state, school district, or even a classroom. ○ Special Population Norms: based on personal characteristics or attainment ○ Growth Norms: used to assign percentiles and standard scores to differences in scores from one test to another.

RELEVANT CHARACTERISTICS:

● Sex: On most psychological and educational tests, gender differences are small, and the distributions of scores of males and females tend to overlap considerably. ○ Female/Male differ in terms of physical development, maturation, gender role expectations. ○ Norm groups should contain both males and females in the appropriate proportion found in the general U.S. population (approximately 48% male and 52% female) ○ Separate norms for males and females are preferred when the use of combined norms leads to misinterpretations and poor decisions. ● Age: Chronological age is clearly related to maturation for a number of abilities and skills, and norms frequently use age groups of one year (however, at younger ages we want norms for narrower age ranges such as months because developmentally young children are changing faster) ● Grade: All achievement tests should measure learned facts and concepts that have been taught in school. ○ The most useful norm comparisons are usually made to students of the same grade, regardless of their ages (however students of different ages are within the same grade) ● Geography: there is systematic difference in the attainment of individuals living in different geographic regions. ○ Community size, population density (urban, suburban, rural communities) and gains or losses of population have also been related to academic and intellectual development. ○ I.e. average scores of individuals living in the southeastern US (excluding Florida) are often lower than the average scores of individuals living in other regions. ○ Test norms should include individuals from all geographic regions, as well as from urban, suburban, and rural communities. ● Race/Ethnicity: particularly important because: ○ The scientific and educational communities have often been insensitive and occasionally blatantly racists (i.e. Stanford-binet intelligence scale excluded non white individuals) ○ Persistent racial differences in tested achievement and ability remain, although these differences continue to narrow ○ Genetics, environment, interactions between genes and environment, biased test construction all influence. ● Acculturation of Parents: Acculturation is an imprecise concept that refers to an understanding of the language, history, values, and social conventions of society at large. ○ Typically test authors use the socioeconomic status of the parents (usually a combination of education and occupation of the parents) as a very general indication of the level of acculturation of the home. ○ SES of parents is strongly related to scores on all sorts of tests, including intelligence, achievement, adaptive behavior, and social functioning ■ Ex. Of all the characteristics of the WPPSI norm sample, parent education had the strongest relationship to all three IQ indexes. ● Intelligence: is certainly related to achievement because most intelligence tests were actually developed to predict school success. Correlations are generally positive but decline as students age. ○ Because language development and facility are often considered an indication of intellectual development, intelligence tests are often verbally oriented. They tend to correlate with scores of tests of linguistic or psycholinguistic ability.

STAGES OF ASSESSMENT

1. Screening 2. Pre-referral 3. Eligibility for Special Education 4. Instructional/Intervention Planning 5. Progress Monitoring/Evaluation 6. Accountability

Scores of Relative Standing:

: facilitate interpretation of assessment results (with respect to a norm group) because their units of measurement retain their meaning across tests and ages. Percentile Ranks: = (percentage of individuals with scores below a particular score) + (1/2 of the percentage of individuals with scores at that score). ● can be used with either ordinal or interval data. ● Ex. an individual who has a percentile rank of 50 on an academic test, has attained a score equal to or better than 50 percent of the individuals who have taken the test. ● Not possible to have a rank of 0 - 100 ● Possible to have decimals ● Median = 50th percentile rank ● Deciles: ○ Each decile is comprised of 10% of the norm group ○ The first decile ranges from 0.1 to 9.9, and the last 90.0-99.9 ● Quartiles: ○ Each quartile is comprised of 25% of the norm group ○ The first decile ranges from 0.1 to 24.9, and the last 75.0 to 99.9 Standard Scores: are raw scores that have been transformed so that they have a set (standard) mean and standard deviation ● Can be used with only interval data ● Using the frequency distribution of a normal curve, raw scores are transformed into a Z score, which is a standard score with a mean of zero and a standard deviation of one ● Advantages: ○ Common metric for comparison purposes - can convert one standard score into another score ○ Equal interval data - standard scores can be used in mathematical computations (e.g., subtraction) ● Disadvantages: ○ Consumers may have difficulty interpreting standard scores ● Frequently used standard scores: ○ Z-Scores: ■ On many assessment instruments, Z scores range from 3 to +3 ■ Often transformed into other standard scores in order to eliminate using negative numbers ■ Disadvantages: computations, communication of results. ■ Linear transformation of a z-score into a standard score: X' = SD(Z) + M ● X' is the person's standard score. SD and M are respectively the standard score and mean of the standard (new) score ■ Ex. Franklin attained a raw score of 25 in 'Attention Problems' on a scale with a mean of 20 and a SD of 5, and a raw score of 30 in 'Hyperactivity' on a scale with a mean of 40 with a SD of 10. Convert his two raw scores into standard scores, which have a mean of 50 and a SD of 10. ○ T-scores: are standard scores with a mean of 50 and a S of 10 ○ Deviation IQ's: have a mean of 100 and a S of 15 (or less frequently 16) ○ Normal-curve equivalents: Have a mean of 50 and S of 21.06 (divides the normal curve into 100 equal intervals) ○ Stanines (standard nines): l divide distribution into nine bands. ■ First and ninth: 1.75 S ■ Second through eighth: .5 S

STABILITY

: is the consistency of test scores over time. ● When a student learns information and behaviors, we want to be confident that they can access that information and demonstrate those behaviors at times other than when they are assessed ● Want to generalize today's test results to other times in the future. ● Devices used to assess traits and characteristics must produce sufficiently consistent and stable results if those results are to have practical meaning for making educational decisions ● The notion of stability excludes changes that occur because the student has been taught new information. ● How its done: A large number of students are tested and then retested after a short period of time (two weeks). Scores are then correlated, and the obtained correlation coefficient is the stability coefficient. ● Test-Retest Reliability: ○ Interval between administrations is critical, preferably two weeks ○ If the interval is too long, error will be inflated and the reliability coefficient will underestimate actual reliability because they might learn new things ○ Sensitization problem: if the interval is too short on achievement or cognitive assessments then reliability might be inflated. Because they are already used to taking the test and might do better just from taking it before.

Reliability:

: is the extent to which it is possible to generalize from an observation or test score made at a specific time by a specific person to a similar performance at a different time, or by a different person. ● Reliability is a necessary condition for validity. If you are not measuring validity in a reliable way, you don't even know what you are measuring. ● It is usually impossible to assess an entire domain of information; We attempt to estimate the true score by using a sample of items, and by taking into account an estimate of the error present in the measurement. ● Two Components of Obtained Scores: ○ True Score: never know for certain, its an estimate. ■ What would the individual's score be if all items in the domain were sampled? ■ Not practical to use all items in a domain. Therefore, we resort to some type of sampling strategy which can be influenced by error ● Error (random noise): lack of generalizability of results. ○ If the sample we obtain is not representative of what we are trying to measure, the score could be mistakenly high or low.

Developmental Scores

Age Equivalent: the student's raw score is the average (e.g. mean) performance for an age group ● Years and months (e.g. 5 years and 2 months) Grade Equivalent: The student's raw score is the average (e.g. mean) performance for a particular grade level (school calendar year) ● Grades and 1/10ths of grade (e.g. 6.1 grade level) ● How grade equivalents are determined: ● A test is given to a large group of students, a norming group, in a school year at a specific month. For example a test was given in October to a norming group consisting of fifth-grade students. The raw score average is calculated for these students. The average is said to be equivalent to the year and month at which students in this grade is functioning. For example, if the average raw score is 175 for the fifth grade norming group, 175 will then be set equivalent to fifth grade, first month (October). Any student who subsequently scores a 175 will be said to be functioning at the fifth grade first month (5.1). The same test is also given to two other groups of students. One group is one year above the grade and one is one year below the grade, both in the same month and the same year as the first group. Averages are then calculated for these groups. For example, the fourth grade average might have been 164 while the sixth grade average might have been 184. The average difference between these two scores and the grade 5 score is calculated. In this example the average difference between the scores is 10. (The difference between 164 and 175 is 11, which is 11 points. The difference between 175 and 184 is 9 points. The average of 11 and 9 is 10). It is then concluded that the average difference between the scores represents the increase in student scores over one year. Since 175 represents the fifth year first month, anyone scoring 185 would be said to be functioning at the sixth year first month, anyone scoring 195 would be said to be functioning at the seventh year first month, etc. This process is referred to as extrapolation, which is estimating outside of the range for which there is information. This is bad because education might not be linear. Extrapolation by itself does not let us make comparisons for students whose scores fall between the above averages ● Scores are also interpolated which means estimating what students' grade levels would be between the averages (estimating what their score would be during the other nine months of the year). Using the example, it would be determining what grade level would be associated with a score of 178, which falls between 175 and 185. In this case it is relatively easy since an increase in a point is equivalent to an increase in a month (there are ten months and a ten point difference). The grade level associated with a raw score of 178 would be fourth grade, fourth month. But it is generally not the case that a point is equivalent to a month. That is just for this example. ● Why do developers of assessment instruments use interpolation and extrapolation instead of measuring a large group of students at every year and month? ○ It would be super expensive to get a group of kids, of every year/month to find averages. And usually only school psychologists use this information, so it is not a big enough group of people that need it.

FOUR APPROACHES TO ASSESSMENT:

Observations, Extant INformation, standardized tests, recollections

Variance

a single number that describes the dispersion of scores around the mean. ● Derived from all scores in the distribution ● Variance is equal to the average squared distance of the scores from the mean ○ Squared to the scores below the mean don't cancel out the scores above the mean (no negative or zero) ● Average distance scores are from the mean ● Variability is helpful when you have a lot of subtests, it is also useful when there is a lot of variance in a course. ○ S2 is variance (scores squared) ○ ∑ is the sum ○ X is a score ○ X(with a line above it) is the mean ○ N is the total number of scores

Severe Discrepancy Model

attempts to operationalize the gap between a students potential and his/her current achievement. If this discrepancy is severe and other causes can be ruled out (i.e. visual impairment) then a student can be said to be learning disabled. ● 2004 reauthorization of IDEA abolished the requirement that a "severe discrepancy" exist between achievement and ability in order to classified as learning disabled. ○ Concerns about the Severe discrepancy approach: ○ IQ is an imperfect indicator of intelligence ○ IQ and achievement are not independent abilities ○ The approach is disconnected from what the student is learning in the classroom and has no clear implications for instructional planning. ○ Additional help is provided only when the problem is "Severe" ● "Special education is reserved for students who have disabilities that cause difficulty in learning. Therefore, the first area that must be assessed is the area of the suspected disability. Neither federal nor state law is prescriptive on what type of evaluator is qualified to make certain assessments, however, it is clear that evaluators must be trained and knowledgeable in addition to having appropriate certification or license in their field." ○ If given the intervention for the specific difficulty, and the child still doesn't respond then LD ● Most education problems begin as discrepancies between our expectations for students and their actual performance. Students may be discrepant academically (not learning as fast as expected), behaviorally (they are not acting as they are expected), or physically (not able to sense or respond as expected. (ch 2) ● The crossover point between a discrepancy and a problem is a function of many factors: the importance of the discrepancy (not knowing how to print a letter, vs forgetting to dot your i), the intensiveness of the discrepancy (a throat clearing tic, vs shouting obscenities), etc. (ch 2)

INTEROBSERVER AGREEMENT

he consistency among test scorers. ● Want to assume that if any other comparably qualified examiner were to give the test or make the observation, the results would be the same, two ways: ● Correlational Approach: similar to estimating reliably with alternate forms: two testers score a set of tests independently. Scores obtained by each tester for the set are then correlated. ● Percentage of Agreement Approach: more common in classrooms and applied behavioral analysis. Instead of correlation between two scores ratings, a percentage of agreement between raters is computed. Three ways: ○ Simple Agreement: Using a time table you and someone else count how many times an observed behavior occurred over a certain amount of observations. ■ Calculated by dividing the smaller number of occurrences by the larger number of occurrences and multiplying the quotient by 100. ■ Tells amount of similar counts, but not when it happened. ■ Should not be used since it is not based on each observation. ■ Ex. You and Lou observe a child 20 times to see when he clapped, you said he clapped 13 times, Lou said 11 times. Simple agreement is 100 x (11/13) = 85% ○ Point-To-Point Agreement: Using a time table you and someone else count how many times an observed behavior occurred and did not occur, and then look at the times you agree on ■ ■ Calculated by dividing the number of observations for which both observers agree (occurrence and nonoccurrence) by the total number of observations and multiplying the quotient by 100. ■ Can range from 0% to 100% ■ 80% is usually considered minimally acceptable. ■ More precise because it takes into account each data point. ■ More accurate than "simple agreement" but it can overestimate agreement, when occurrences are either very high or very low. ■ Ex You and Lou observe a child 20 times to see when he clapped, there are 14 occasions when your and lou's observations were the same (for occurrence and non occurrence). 100 x (14/20) = 70% ○ Agreement of Occurrence: When occurrences and nonoccurrence's differ substantially ■ ■ Of all the times you and Lou agree on occurrence and nonoccurrence, this equation only takes into account the times behavior occurred ■ More accurate than point-to-point when occurrences are either very high or very low. ○ Cohen's Kappa (K): Adjusts for chance agreements better than occurrence agreement. However, it can sometimes underestimate agreement. ■ Two people can agree because of actually agreeing or by the 50/50 chance of guessing when they weren't sure, so Kappa corrects for the probability of chance. ■ Answers the question: Did these agreements occur at a level beyond more chance? ■ Kappa (K) can range from -1.0 to +1.0 ■ 0 = chance agreement ■ Minimally acceptable level for an assessment is usually K = 6 ■ Most statistical software programs (SPSS) have a command for computing kappa

STANDARD-REFERENCED ASSESSMENT

interpretation of an individual's performance based on comparison to the state standards. Four components: ● Level of performance: the entire range of possible student performances is divided into a number of bands or ranges. ● Objective Criteria: ● Examples: of the students work ● Cut scores: these scores provide quantitative criteria that clearly delineate student performance level.

NORM-REFERENCED ASSESSMENT

involved comparing a student's score to a normative sample, usually of students of similar demographics. ● Assumes that the characteristic being measured is normally distributed in the population ● In order to make this comparison the students scores are turned into Derived Scores: ● Two types of comparative (derived) scores: ○ Developmental Scores (boo) ■ Age equivalents ■ Grade equivalents ○ Scores of Relative Standing (yay!) ■ Percentiles

Normative Sample

is the sample of individuals to whom an individual is compared when one obtains a derived score ● Data (e.g. mean score) about a group's performance (Standardization sample) on a particular assessment ● Particularly important for "norm referenced" assessment instruments. (authors should provide) ● Should refer to clearly described populations. These populations should include individuals or groups with whom test users will ordinarily wish to compare their own examinees. ● Standardization Sample: ○ The group to which the individual is being compared. ○ Must be clearly described

Standard Deviation

is the square root of the variance, and has greater utility in understanding assessment results than variance. ● ● S has particular value in conjunction with a normal curve distribution ○ We can determine the proportion (percentage) of cases that occur between the mean and a specific standard deviation ○ S facilitates comparison of scores across assessment instruments. ● Regardless of the assessment instrument, if scores are distributed normally, then we can determine % of cases falling with a particular S ● Standard deviations units are often referred to as "z-scores"

■ Reliability coefficien

of the instrument and/or of the relevant subscale or subtest of the instrument. ● Similar content items: ○ Alternate Form Reliability ○ Internal Consistency ● Time ○ Stability ○ Test-Retest Reliability ● Scorers or observers (will two people observing the same child agree) ○ Inter-Rater: behavior rating system ○ Inter-Observer: direct observation ○ Inter-Scorer: achievement tests

CORRELATION

refers to the quantification of the relationship between two variables. ● Correlation coefficient is a numerical representation of this relationship. ● Bivariate correlations: occur between two variables. ● Correlations can be used to measure both the reliability and validity of assessment instruments. ● Can vary from -1.00 to +1.00 ○ +1.00 = a perfect positive relationship ○ -1.00 = a perfect negative (inverse) relationship ○ 0 = no relationship

CRITERION-REFERENCED ASSESSMENTS

there must be a clear, objective criterion for each of the correct responses to each question, or to each portion of the question in partial credit is to be awarded. ● Used when we are testing a student's knowledge on a single fact ● Single-Skill Scores (i.e. drinking from a cup) ○ Dichotomous: e.g. pass/fail ○ Multiple points along a continuum: e.g. almost always to never ● Multiple-Skill Scores (e.g oral reading) ○ Percentage: number of possible correct responses, and number of responses attempted. ○ Retention: % recalled of what was learned, and verbal labels for percentages (e.g. independent level (>95%), instructional (85-95%), frustration (<85%) ○ Rate: the number of correct responses within a time limit

6) Accountability

● Accountability Decisions: are those in which assessment information is used to decide the extent to which school districts, schools, and individual teachers are making adequate progress with the students they teach. ○ Decisions like: is the country, state, or school district making progress? Should the student receive a diploma? ● If tests are used, there should be a close relationship between the test items (content) and what is being taught (curriculum) ● By law, states, districts, and schools must demonstrate that the students they teach are making Adequate Yearly Progress (AYP). If not, sanctions are applied, and if still not students can move to different districts.

3) Recollections

● Advantages: ○ Broad- learn so much more than one behavior ○ Time efficient and convenient: can assess a broad range of areas in a relatively brief time and can be completed without the presence of a trained examiner ○ Facilitates Multi-method, multi-source assessment: complements direct observation and other assessments. ○ Targets same area more than once ● Disadvantages ○ Self-report related problems: social desirability, memory (the longer the interval between the event and the reporting, the less trustworthy are the data) ○ Can reflect biases of teachers, etc. ○ Often not sufficiently sensitive to assess progress.

4) Using Extant Information

● Advantages: ○ By definition already exists - time efficient ○ Consistent with a multi-method, multi-source approach ○ Can complement other sources of information ● Disadvantages: ○ Don't know the reliability of infraction ○ Valuable information about the context of behavior or performance might be missing ○ No useful extant information may exist for specific areas of concern ○ Some extant information may have unknown reliability and validity or suffer from biases

2) Standardized Tests

● Advantages: ○ Can be compared to norm groups and administered to large groups ○ Substantial research has been done with many tests ○ Many tests have known (and strong) psychometric characteristics ○ Know the conditions/context to which an individual is responding. Facilitates interpretation of the quantitative results ○ Performance standard / level of attainment ○ Comparison standard / group (age, grade, etc.) ○ All interpretation should to be placed with an ecological context ● Disadvantages: ○ Psychometrically sound tests are not available for all purposes and populations ○ Commercial viability ○ A test might not be equally valid for all cultural groups ○ Need test to be in native language ○ Test results can be misinterpreted by the, administrator and consumer ○ Extraneous factors, such as anxiety, can interfere with performance ○ Many tests are not designed to provide clear and specific implications for interventions ○ Low scores can have social and psychological consequences ○ "the intelligence test can be turned into an engine of cruelty....it could turn into a method of stamping a permanent sense of inferiority upon the soul of a child."

1) Observations

● Advantages: ○ Can be very accurate/valid ○ Can provide information about the context, may be variables that trigger the problems in multiple contexts. ○ Can reduce self-report related problems such as social desirability and memory ○ Can be sufficiently sensitive to assess progress, even if there are small changes in behavior you can monitor it. ○ Cognitive biases, such as expectancy and halo effects can be minimized. ○ Carefully and specifically defining the behavior. ○ See the child in the natural environment of the classroom ● Disadvantages: ○ Narrow ○ Time consuming - preparation and observation ○ Difficult to reliably and accurately observe a broad array of behaviors ○ The more narrow is the band of behavior, the better is the reliability ○ Can have a reactive effect on behavior. Student/teacher might act differently when you are in the room.

Ordinal Scales

● Are ordered values from worse to better, providing a rank ○ Ex. Poor - ok - good - better- best ● Assignment of numbers (or names) for the purpose of determining the ordered relations within a characteristic ● Second weakest scale of measurement. ● The ranking does not specify the interval or distance between the ranks, and the raw scores that correspond to the ranks cannot be assumed to have equal distances. ○ Ex. Ranking of colleges by class rank. Or something like "please rank order your preferences for the following rewards beginning with the best to the least" ● Precludes the use of many simple statistical summarizations such as the mean. ● Many tests scores are converted into an ordinal scale because they are easy to understand. Ex. ○ Age equivalents - almost like a ranking, but hard to tell where they actually stand ○ Grade equivalents ○ Percentiles ○ Rankings ● Since the difference between ratings is unknown and probably unequal, scores cannot be added, multiplied, divided, or averaged. ● Example: age equivalence scores

C) Eligibility for Special Education

● Eligibility for special education services: use of assessment information to decide whether a student meets the state criteria for a disability condition and needs special education services to be successful in school. ○ Must be shown to be exceptional (have a disability) AND to have special learning needs. ○ States may differ in what they consider disability- state/federal guidelines are used to determine eligibility ○ States can provide special services for students beyond the ones listed in IDEA (i.e. gifted students) ● Must use multiple sources of information to determine eligibility ● IEP team established at this level ● Parents can resist/deny referral and SPs can ask teachers to reconsider

Equal Interval Scales

● Have ordered and equal intervals, but has no zero point. ● Equal Intervals are a fundamental assumption to many basic mathematical operations, such as adding, subtracting, multiplying and dividing. ○ Example of Arbitrary Zero Point: If someone fails all items on an intelligence test, it does no mean that the person has zero intelligence. ● Interval scales are often used in psychological and educational assessment. ● All assessment instruments using "standard scores" are based on interval level of measurement. ● Because equal interval scales lack an absolute zero, we cannot construct ratios with data measured on these scales. ● Difference is the same regardless of point on the scale

D) Instructional Planning and Modification

● Instructional planning and modification decisions: involve the collection of assessment information for the purpose of planning individualized instruction or making changes in the instruction students are receiving. ● Three types of decisions are made: ○ What to teach: content decisions usually made on the basis of a systematic analysis of the skills that students do and do not have. ○ How to teach it: by trying different methods of teaching and monitoring students progress toward instructional goals. ○ What expectations are realistic: are always inferences, based largely on observations of performance in school settings and performance on tests. ● These decisions are embedded into the IEP ○ No Child Left Behind Act: the major federal law governing delivery of elementary and secondary education, states that schools are to use "evidence based" instructional practices. ● Program Evaluation: gauging the effectiveness of the curriculum in meeting the goals and objectives of the school. See which programs work for the students. ● Resource Allocation: use of assessment information for the purpose of deciding what kinds of resources and supports individual students need in order to be successful in school. ● Selecting Assessment Methods: should be driven by ○ Referral information ○ Background information ○ Hypotheses about what you think might be causing the problems that people are concerned about (what in the environment might be causing this) ● Considerations in Choosing Assessment Methods: ○ Multi-method: using more than one method and getting the same result will give you more confidence. Don't want to do during screening because it is too time consuming, but helpful when analyzing one child. ○ Multi-Source: You want to hear information from many different sources: teachers, parents, student, etc. ○ Multi-Setting Approach: Someone might act a certain way in one environment and different in another. ○ Latest Version of the instrument: sometimes hard because of cost, but kids and expectations for kids change. So instruments have to be updated. (Ethical matter) ○ Psychometric: reliability, validity, norms ○ Valid for the purpose of the assessment (if you want to know someone's head size you don't measure their hand) ○ Culture and language: Especially important for ELL. No such thing as a culture free assessment, everything we do is imbedded in culture. How much is culture impacting the performance? And what does the mean for how you interpret the results ○ Development: Age, maturity, and what is the appropriate assessment.

Problems with Developmental Scores

● Interpolation & Extrapolation ○ Often impractical to include a sufficient number of individuals at each age (or grade) level AND month in the normative sample. Average age and grade scores are estimated for groups of children who are never tested. ● Easy to misinterpret the meaning of age and grade equivalents. ○ Ex. A 12 year old child might attain a grade level of 5.1 years by failing some relatively easy items and passing some relatively difficult items, whereas a 10 year old attains the same score by passing all the easier items and none of the more difficult ones. This could be misleading because parents might assume that their child is doing well / not well based on what the "average" is without understanding that there is different scoring routes. ○ Ex. An age equivalent score for a preschool child that reflects a 6-month delay may have more practical significance than a six month delay for a 14 year old. ● Promotes inappropriate stereotypical thinking ○ The "average" child does not exist. It's a convenient abstraction. There is variation in performance at every grade and age level. ● Implication of a false standard of performance ○ Very few students are exactly at age (grade) level. Assessment instruments are constructed so that half the individuals in age group will score above the mean of their group and half will score below. ● Age/grade equivalents cannot be assumed to have equal intervals. ○ Developmental patterns are not linear. A six month delay at age 4 might result from missing several items, whereas the same delay at age 16 might be the result of failing two items. ● Age/grade equivalents are intuitively appealing to the public and parents

REPRESENTATIVENESS:

● Norm sample should contain students with "relevant" characteristics ● These characteristics should occur in the same proportion in the norm sample as they occur in the population. ● Including individuals from all groups is important because: ○ A) the extent that individuals from various backgrounds perform differently, test parameters (i.e. means and variances) are likely to be biased without the inclusion of individuals from different backgrounds ○ B) because test authors frequently base their final selection of items on the performance of the norm group, exclusion of individuals from diverse backgrounds can bias test content." ● Proportional Representation: in regard to relevant characteristics, the norm sample must be carefully compared to the overall population. Representativeness is judged with respect to a specific norm group that is appropriate for an individual.

B) Pre-Referral (Tier 1) :

● Pre-Referral: before a student is referred for an eligibility (special education) assessment. ○ Ex. Teacher consults with a school psychologist or with an intervention assistance team (student support team) ○ Happens while a student is still in general education classes ● Doesn't necessarily follow screening- sometimes information is not caught in screening ● Pre-referral process is mandatory in most states but formality varies from state to state ● The referral question or problem should be used as a basis for conceptualizing the assessment case. ● A teacher or parent has to ask for this to happen before the student is in special education. Teachers don't always want or understand pre-referrals and just want to student to immediately go to special education. ● "The psychologist must look at the referral statement not as a statement of the problem but as only one of the many possible sources of data. For this reason, the assessor must approach requests for help with guarded skepticism to avoid accepting implicit or explicit a priori to assumptions in the inherent referral."

THE RELIABILITY COEFFICIENT

● Reliability Coefficient: indicates the proportion of variability in a set of scores that reflects true differences among individuals. (symbol is R with either xx or aa small underneath) ○ Different individuals attain different scores. To what extent do these differences reflect true differences or merely error? ● The reliability coefficient is a ratio that reflects the variance of true scores over the variance of obtained scores for a distribution. ○ Subtracting the proportion of true-score variance from 1 yields the proportion of error variance in the distribution of scores. ○ Ex. If a variance of true scores = 900, and the variance of obtained scores = 1000, then reliability = 900/1000 = .90 ● Perfect reliability coefficient = +1.0 ○ Obtained variance = true variance (no error) ○ In assessment instruments, a coefficient of at least .90 is usually considered minimally acceptable for eligibility decisions and .80 for other decisions. ● Assessment manuals should indicate at least two types of reliability: time (test-retest), content/items Three Types of Reliability: ● Item Reliability: the extent to which we can generalize to different samples of items ● Stability: the consistency of test scores over time ● Interobserver Agreement: the consistency among test scores

A) Screening (tier 1):

● Screening: the collection of assessment information for the purpose of deciding whether students have unrecognized problems. ● Initial stage of assessment ● Looking at is the student "at-risk"? Intent is to determine if more specific assessment/intervention is needed. ● Can screen for academic problems, behavioral problems, medical conditions, "school readiness", or specific conditions. ○ Universal screening: an entire school or a large group of students can all easily participate. ● Preventative: can catch problems early before they become severe.

■ Standard Error of Measurement (SEM)

● When assessment results cannot be generalized, they are in error and the results are incorrect. The errors associated with each of the three dimensions (items, times, raters) are independent of each other (i.e. uncorrelated) and additive. Measurement error is the sum of the three kinds of error.


Ensembles d'études connexes

Chapter 1: Introduction to Environmental Science

View Set

Preparing Effective business plans

View Set

Abdomen: Peritoneum and Peritoneal cavity

View Set

Instrument Rating Knowledge Exam

View Set