III: Assessments
Assessment Purposes
Assessment is a process by which teachers can evaluate student learning and achievement, the success of a lesson, or other elements of classroom experience. A teacher should view assessment as the criterion for determining whether an instructional goal has been achieved.
Assessment Measures
...
Commonly Used Standardized Tests
...
IIA(4): Assessment Tools
...
IIB-1: Types and Purposes of Standardized Tests
...
IIIA(1) - Types of Assessments
...
Performance Assessment (Testing Alternatives)
A performance assessment is a specific type of observation frequently used for assessment of procedural knowledge (e.g., skills). In some cases, students are assessed as they perform a particular procedure (e.g., mixing chemicals in the lab, playing piano); in other cases, the product is assessed (e.g., the color of the liquid in the test tube after mixing, the piano concerto composed by the student). Performance-based assessments tend to focus on procedural knowledge, not specific content knowledge. Performance assessment refers to the type of activity involved—students are evaluated based on their performance on a task—and can be used for formative, summative, or diagnostic purposes.
Testing Alternatives
Alternative assessment approaches include observations, performance evaluations, portfolios, and conferencing, among others. Used alone or in combination with testing, these qualitative assessments can provide a more complete picture of a student and his or her achievements and abilities.
Peer Evaluation
Although peer evaluation in the classroom is a meaningful method for students to gain feedback on their work, it is crucial that before allowing students to assess one another's work, the teacher must first provide specific criteria for peers to consider in others' work. Students, like teachers, need to be clear on the evaluation criteria before beginning any assessment—this ensures reliability and some objectivity. A system for discussion/challenge of peer assessments allows the comparisons necessary for the reliability of scores to be tested. (Reliability refers to the consistency of scoring—for example, would students give the same scores that trained teachers (or different peers) would give?)
Analytic Scoring
Analytic scoring provides sub-scores or individual scores for various parts of the assessment. Note that analytic and holistic scoring systems can be criterion referenced or norm referenced, depending on whether they are reported according to benchmarks/criteria or according to a comparison with others' performance.
Assessments in the Lesson Planning Process
Effective teachers consider assessment when first planning a lesson, and many make the assessment tools (e.g., rubrics, scoring systems) available to students at the start of the lesson to help them focus their attention on the learning goals, identify the cognitive processes necessary to achieve those goals, and provide ongoing feedback about what has and has not yet been learned. To be useful, assessments must be clearly tied to learning objectives and any content standards, but they should also be flexible enough that the teacher doesn't end up "teaching to the test." Wise teachers also realize that no single assessment measure should serve to determine a student's achievements, aptitudes, or abilities. What's critical is that the proper assessment be selected for the intended purpose. For example, if the instructional objective involves students' ability to use various scientific formulas successfully in hypothesis testing, then any assessment should measure those procedures, rather than merely recall the formulas themselves. If the instructional goal includes students' use of proper spelling in an essay, the assessment should require students to write an essay, rather than identify misspelled words, either in or out of context.
Formative Assessments
Formative assessment is conducted during implementation, often to evaluate the need for change. Formative evaluations are designed to provide information regarding what students know and can do before or during instruction. For example, a geography teacher may interrupt a lecture to say, "Please make a list of all the state capitals you can think of right now," or an English teacher might say, "Take out your journals and summarize the main points we've been discussing today." Use of formative assessment helps to drive the direction of the lesson — allowing teachers and students to judge whether the lesson is going well, whether students need additional practice, or whether more information needs to be provided by the teacher. Teachers engaging in reflective practice often use formative assessments to evaluate their own achievements in light of student performance as well (e.g., how well am I getting the point across? am I providing enough opportunities for critical thinking? etc.).
Holistic Scoring
Holistic scoring involves summarizing a student's performance on an assessment with a single score reflecting an overall impression.
Central Tendency (Norm-Referenced)
Measures of central tendency are indicators of the score that is typical or representative of all test takers (i.e., the distribution of scores).
Percentile Rankings (Norm-Referenced)
Percentile rankings are one type of norm-referenced score. In percentile ranking, each student's individual score is compared with the individual scores of other students taking the same test at the same time. The percentile rank shows the percentage of students in the group who scored equivalent to or below a particular raw score — not the percentage of correct answers. For example, if a student accurately answered 83% of the questions on a test and had the highest score in the class, the student would have a percentile rank of 100% — 100% of the group scored at or below 83% correct. In this situation teachers would describe this student as "in the top 1% of the class." Other types of norm-referenced scoring require an understanding of basic descriptive statistics, including measures of central tendency and variance.
Authentic Assessments - Performance Assessments (Testing Alternatives)
Performance assessments are often described as "real-world" or "authentic" experiences because they assess a student's ability to transfer and apply learned skills to novel situations. Performance assessments are well suited for the arts and for laboratory sciences; they are also useful as authentic assessments that emphasize skills used outside the classroom in the "real world." For example, a chemistry performance assessment may include the prompt, "Is this sample of water safe to drink?" Because of their relevance and hands-on characteristics, performance assessments may increase student motivation, especially when used formatively.
Practicality
Practicality refers, broadly, to ease of use. For example, when evaluating practicality, teachers may ask, Is the measure affordable given the budget? Can it be administered by current staff, or with little training? Is special equipment needed? Can it be completed in the time allotted? Sometimes, measures that are standardized, reliable, and valid are simply impractical given the circumstances.
Achievement Tests
Purpose: To assess how much students have learned from what they have specifically been taught. General Description: Test items are written to reflect the curriculum common to many schools. Test scores indicate achievement only in a very broad and (usually) norm-referenced sense: They estimate a student's general level of knowledge and skills in a particular domain relative to other students across the country. Special Considerations: These tests are usually more appropriate for measuring general levels of achievement than for determining specific information and skills that students have and have not
Formal Performance Assessment
Reliability: It is often difficult to score performance assessment tasks reliably. Teachers can enhance reliability by specifying scoring criteria in concrete terms. Standardization: Some performance assessment tasks are easily standardized, whereas others are not. Validity: Performance tasks may sometimes be more consistent with instructional goals than paper-pencil tasks. A single performance task may not provide a representative sample of the content domain; several tasks may be necessary to ensure content validity. Practicality: Performance assessment is typically less practical than other approaches. It may involve special materials, and it can take a fair amount of class time, especially if students must be assessed one at a time.
Formal Paper-Pencil Assessment
Reliability: Objectively scorable items are highly reliable. Teachers can enhance the reliability of subjectively scorable items by specifying scoring criteria in concrete terms. Standardization: In most instances paper-pencil instruments are easily standardized for all students. Giving students choices (e.g., regarding topics to write about or questions to answer) may increase motivation, but it reduces standardization. Validity: Using numerous questions that require short, simple responses can make an assessment a more representative sample of the content domain. Tasks requiring lengthier responses may sometimes more closely match instructional goals. Practicality: Paper-pencil assessment is usually practical. All students can be assessed at once, and no special materials are required.
Standardization
Standardization refers to uniformity in the content and administration of an assessment measure. In other words, standardized measures have similar content and format and are administered and scored in the same way for everyone. When tests are standardized, teachers have a way to compare the results from diverse populations or different age groups. For example, if a child takes the same standardized achievement test in both the third and fourth grades, a teacher (or the parents) can compare the results to determine how much the child learned in the intervening time. Using measures that are standardized reduces bias in testing and scoring.
Selected Response Tests (Objective Tests)
The element that distinguishes selected response from other assessment models is that there are right and wrong answers that can be impartially scored. Well-constructed selected-response tests (such as multiple choice) have clear correct and incorrect answers that can be easily identified by students with the proper knowledge base. Typically, they test recognition memory rather than recall and product more than process, although good selected-response tests can address some procedural knowledge and also higher-level learning such as critical thinking.
Rubrics
The main distinction between an analytical rubric and a holistic rubric is that an analytical rubric measures performance on each element of the assessment, whereas a holistic rubric measures performance on the assessment as a single entity. Holistic rubrics can use numerical values or describing terms and can be used for measuring tasks in any discipline. Additionally, both types of rubrics can be created entirely by the teacher or in cooperation with students or colleagues. The greatest impact of rubrics on student achievement comes from encouraging students to be more thoughtful.
Mode - Central Tendency (Norm-Referenced)
The mode, the score that occurs most often, is another measure of central tendency, although it's less often used to characterize student performance on an assessment measure.
Mean - Central Tendency (Norm-Referenced)
The most frequent measure of central tendency is the mean, which is simply the arithmetic average of a group of scores.
Reliability
The reliability of an assessment instrument refers to its consistency in measurement. In other words, if the same person took the same test more than once under the same conditions and received a very similar score, the instrument is highly reliable. If an assessment instrument is not reliable, teachers cannot use the results to draw inferences about students' achievement or abilities.
Validity
The validity of an assessment instrument refers to how well it measures what it is intended to measure. For example, a final exam with only 10 multiple-choice questions is probably not a valid measure of the amount of information a student has learned in an entire term, nor is it likely a valid measure of the skills a student has learned during that same time period. Note that the validity of any measure depends on the purpose and context of its intended use. The same assessment instrument may be valid for some purposes and less valid for others. For example, a performance assessment may be a valid measure of laboratory skills but not a valid measure of content learned in science class. Measures that are not valid should not be used.
Validity Test Considerations
Validity refers to the extent to which an assessment instrument actually measures what it is intended to measure-typically, the content that has been taught. The validity of a test can generally be increased by including test items that measure only the content that has been taught.
Variance
Variance refers to the amount of spread among scores.
Standard Scores and z-Scores
When instructors know both the mean and standard deviation for a group of scores, they can easily determine how any individual compares with the larger group. Sometimes, however, scores are reported as standard scores, which are derived from the standard deviation. A z score, for example, indicates how many standard deviations above or below the mean a particular score is. For example, if the mean of a test is 60 and the standard deviation is 5, a student scoring a 55 will have a z score of -1 and a student scoring a 70 will have a z score of 2.
Median - Central Tendency (Norm-Referenced)
When there are a few very high or low scores, the median may be a better representative of the central tendency of a group. The median is the middle score in a ranked list of scores. By definition, half the scores are larger than the median, and half are smaller.
III-A(6). Assessment Formats
Written tests may be the most commonly used assessment measures.
IIIB-2: Test Scoring
...
Age-Equivalent Scores (Criterion-Referenced)
Age-equivalent scores are computed by comparing one person's performance to the average score for all individuals at the same age taking the same test.
Evaluating RSVP Considerations of Different Kinds of Assessments
Assessment measures, and the teachers who use them, need to be fair and unbiased. When creating or selecting assessments, teachers must look for bias in content (e.g., material known by only one culture) or in administration (e.g., assessing students with limited English-language skill is challenging). Teachers need to remain aware of the diversity of the student population when considering standardization, validity, and practicality.
Criterion-Referenced Scoring
Criterion-referenced scores indicate how well an individual met specific standards, such as percentage of correct responses or "advanced" responses on a rubric. State standards are criteria used in criterion-referenced scoring. Criterion-referenced scores specify how one student's raw score compares with an absolute standard based on the specific instructional objectives. For example, if a 100-point test is constructed to sample the content of one semester, then a student who has a raw score of 58 can be said to have mastered approximately 58% of the course material. Note that when a criterion-referenced scoring system is used, each student is evaluated against the standard (i.e., the criterion), not against other students. In many classrooms, teachers assign letter grades based on criterion-referenced scores (e.g., 90% or above = A). Note that criterion-referenced scoring systems need not be point totals; rubrics that provide detailed descriptions of expected performance at each scoring level are also criterion-referenced.
Standardized Testing Criticisms
Critics of standardized testing caution against its use as the primary measure of student achievement and teacher accountability because this results in school districts' using the test content to dictate curriculum and instructional practice. Although curriculum, practice, and assessment should be linked, allowing standardized test content to dictate instructional practices (aka "teaching to the test") does not result in generalized learning.
Diagnostic Assessments
Diagnostic assessments are intended to identify what students know before instruction. Although sometimes used as pretests at the start of a term or unit, diagnostic tests are more commonly used to identify exceptionalities in learning, including disabilities and giftedness. IQ tests, for example, were originally designed for diagnostic purposes. Diagnostic assessments are frequently conducted outside the classroom by education specialists or school psychologists.
Direct Observations (Testing Alternatives)
Direct observations of what students say and do in the classroom can be recorded as anecdotal or running records, which qualitatively capture the flavor of the behaviors, or they can be guided by checklists or rating scales that allow teachers to quantify the observations. These techniques can be used individually or in combination. For example, a teacher may keep a detailed account of behavior as it occurs (e.g., a running record of playground activity that includes aggression among children) and may then use a rating scale to evaluate a particular behavior that occurs during that interval (e.g., very aggressive, moderately aggressive, relatively neutral). Direct observations can be especially useful when both verbal and nonverbal behaviors are recorded (e.g., one observation during free play can provide information about both physical coordination and social skills). Teachers can also review observational records to identify patterns of behavior over time. Note, however, that when observing, teachers must strive for objectivity, and that can sometimes be difficult when the teacher has formed expectations for and relationships with the students.
IIIB-3(a): Evaluating the Quality of Assessment Measures
Each assessment format has specific strengths and limitations; the choice of format ultimately depends on the specific educational context and instructional objectives. To make the determination, teachers rely on four characteristics, which also help determine the quality of any particular assessment tool. The acronym RSVP is used to help recall these characteristics: reliability, standardization, validity, and practicality.
IIIB(5): Interpreting and Communicating Test Scores
Effective teachers can accurately interpret assessment measures — they understand the connections between objectives and measures, and they can make valid inferences from the data regarding a student's ability, aptitude, or performance. Furthermore, effective teachers must explain results of assessments using language appropriate for the audience, whether that audience includes the students themselves, the parents, school administrators, or government officials.
Essay Tests
Essay tests, also known as free-response tests, are an alternative to objective tests; essays require students to create their own answers, rather than select from a set of possible responses. Essays can be quick to construct, although they can be challenging to grade fairly. They are, however, a good teaching tool as well as an assessment measure: When students create responses on essay tests, they are likely to engage in higher-level thinking skills and are better able to transfer their knowledge to other situations outside the testing environment.
Conferences (Testing Alternatives)
Finally, assessments can take the form of one-to-one conferences between a student and the teacher. Conferences need not be oral exams; they can be an informal method for learning more about what the student knows, thinks, or feels and how the student processes learning. Teachers should take care to ensure that conferences are nonthreatening to students, keeping in mind that they must also be focused to yield useful results. A conference may or may not include feedback — when a conference is used just for assessment, the teacher is collecting information about the student but not offering conclusions based on that information. Teachers can learn about students' feelings, conceptual understanding, and prior knowledge through other forms of assessment (e.g., surveys, tests, class discussion). Individual thought processes, however, require introspection and often guided questions, which are best done in conferences.
Formal Assessments
Formal assessments, in contrast, are planned and structured, although they can be used for formative as well as summative evaluation. For example, a pop quiz can be formative if the teacher uses the information not as an assessment of each students' final knowledge but rather to identify areas that need additional instruction, whereas the same quiz at the end of a unit may be summative. A variety of measures can be used for formal assessments.
Grade-Equivalent Scores (Criterion-Referenced)
Grade-equivalent scores show grade-level year (e.g., fifth grade) before the decimal and month into school (e.g., assuming school starts in September, the 4 refers to December). This score reflects an estimate of the test performance of the average student in that grade at that particular time. Thus a score of 5.4 indicates that an average fifth grader would receive this score in December of the fifth-grade school year. Grade-equivalent scores are generally computed by comparing one person's performance to the average score for all students in the same grade taking the same test. For example, if the average (raw) score for all eighth graders taking a reading achievement test in the first semester is 72, then any student who scores a 72 would have an eighth-grade equivalent score; students scoring above 72 would be performing comparably to students in a later semester or in a higher grade.
Assessment Reporting Ethics
To be ethically responsible, a teacher should include the limitations of the assessment method when explaining what assessment data indicate about a student's abilities, aptitude, or achievement. Evaluations should be fair and unbiased, and thus ethics requires that the teacher present information about the limits of the assessment method. It is typically good practice to present information about the conditions under which it was given, although it's not unethical to omit that information (i.e., the teacher has no obligation to present this information). Nor is it unethical to omit information about the performance of the student's peers or the role of assessment in the learning process, although these pieces of information can be useful for parents or others trying to interpret the results.
General Reporting Guidelines
1. In the United States, test scores are confidential information under the Family Educational Rights and Privacy Act (FERPA). Teachers can share assessment results with the student, that student's parents or guardians, and any school personnel directly involved with the student's education. Teachers cannot post scores publicly or in a fashion that allows identification (e.g., by social security number), nor can teachers leave a stack of graded papers for students to pick up. 2. Teachers should be well informed about the test when communicating results to parents or students. Sometimes it's best to use general statements when communicating assessment results (e.g., "your child is on target for children of her age"), but if a parent asks for more detailed or specific information, FERPA requires that it be given. For example, if a student scores at the "proficient" level on a standardized achievement test, a teacher should be able to explain the various other levels, the percentage of students achieving this score, and the reliability of the test. 3. Be attentive to the feelings of the students and/or the families involved. "Your child scored significantly below the rest of the class" may be truthful, but an effective teacher should communicate in a positive, encouraging fashion. 4. Attend to differences in language and culture when discussing assessment results. Be sure that everyone understands the data and the implications.
Portfolios (Testing Alternatives)
A portfolio is a collection of a student's work systematically collected over a lengthy time period. Portfolios can include any number of different items — writing samples, constructions or inventions, photographs, audiotapes, videotapes, and so on. They also frequently include reflections, which are the students' own evaluations and descriptions of their work and their feelings about their achievements. Because of their diversity, portfolios can capture a broad picture of the student's interests, achievements, and abilities and are best used for summative purposes. Student selection of portfolio content and the reflection process both encourage critical thinking, self-regulation and self-evaluation, and metacognitive skills. In addition, students' pride in their work, when collected and displayed in their portfolios, may increase self-esteem and motivation. Portfolios are not standardized-the flexibility of these broad, in-depth alternative measures of student progress is perhaps their greatest benefit, as long as they can be evaluated fairly.
III - Assessments
Assessment refers to the process of drawing inferences about a student's knowledge and abilities based on a sample of the student's work. Assessment goes well beyond just assigning grades. When used successfully, assessment results can provide valuable information about students' achievements and motivations to teachers, parents, students, and educational administrators as well as information about the success of the teacher in meeting his or her personal and professional goals.
Informal Assessments
Informal assessments are spontaneous measures of student achievement. For example, teachers who listen to the types of questions students ask during a lesson are informally assessing the degree to which they comprehend the lesson. Similarly, when teachers observe children during daily tasks — at play, with peers, at their desks, during routines — they are informally assessing them. Informal assessments are not graded and are primarily used for formative purposes: Data collected via informal assessments offer continuous feedback regarding the daily lessons, classroom experience, student motivation, and so on.
Norm-Referenced Scoring
Norm-referenced scores reflect an individual's performance in comparison to other test takers, not against the benchmarks themselves. A norm-referenced score is determined by comparing a student's performance with the performance of others. For example, teachers using a norm-referenced scoring system may determine that the top 10% of scores earn As, the next 10% earn Bs, and so on — regardless of the students' raw scores. Many teachers (usually incorrectly) refer to norm-referenced scoring as "grading on a curve." Norm-referenced scoring is most common in standardized testing but can also be used in other classroom settings (e.g., some instructors grade holistically rather than using a rubric: "This essay is the best in the class and thus earns an A; these two are almost as good and thus earn A-"; note the subjectivity in this type of grading).
Objective Tests
Objective tests (i.e., selected responses) include multiple-choice and matching tests. This type of test is popular for many reasons, including that they can be scored easily and objectively and are efficient and usually inexpensive to administer. Because many objective tests require students to recognize a correct answer, they are best used to assess lesson content that is highly structured or concrete; however, well-designed objective tests can also be used to assess higher-level thinking, such as application or analogical reasoning.
General Scholastic Aptitude and Intelligence Tests
Purpose: To assess students' general capability to learn; to predict their general academic success over the short run. General Considerations: Test items typically focus on what and how much students have learned and deduced from their general, everyday experiences. For example, the tests may include items that ask students to define words, draw logical deductions, recognize analogies between seemingly unrelated topics, analyze geometric figures, or solve problems. Special Considerations: Test scores should not be construed as an indication of learning potential over the long run. • Individually administered tests (in which the tester works one-on-one with a particular student) are preferable when students' verbal skills are limited or when exceptional giftedness or a significant disability is suspected.
Specific Aptitude and Ability Tests
Purpose: To predict how well students are likely to perform in a specific content domain General Considerations: Test items are similar to those in general scholastic aptitude tests, except that they focus on a specific domain (e.g., verbal skills, mathematical reasoning). Some aptitude tests, called multiple aptitude batteries, yield sub-scores for a variety of domains simultaneously. Special Considerations: Test scores should not be construed as an indication of learning potential over the long run. • Tests tend to have only limited ability to predict students' success in a particular domain and so should be used only in combination with other information about students.
Informal Assessment
Reliability: A single, brief assessment is not a reliable indicator of achievement. Teachers must look for consistency in a student's performance across time and in different contexts. Standardization: Informal observations are rarely, if ever, standardized. Thus teachers should not compare one student to another on the basis of informal assessments alone. Validity: Students' "public" behavior in the classroom is not always a valid indicator of their achievement (e.g., some may try to hide high achievement from peers, others may come from cultures that encourage listening more than talking). Practicality: Informal assessment is definitely practical. It is flexible and can occur spontaneously during instruction.
Scoring Assessment Measures
Scoring selected-response measures can be quite easy, especially if the test is carefully constructed. In general, the person scoring the test must only identify whether the test taker selected the correct response for each item; scoring of this sort is objective and fast. Evaluating and/or grading the alternative assessments objectively can be much more challenging, especially when a holistic scoring system is used. Holistic scoring refers to an assessment method in which an overall score is determined based on the teacher's impression of the quality of work. Performance assessments, essays, and portfolios are frequently scored holistically. In contrast, an analytic scoring system is a quantitative measure in which individual components of a project, portfolio, performance, or essay are scored individually and then the scores are added together to form an overall score or grade. For example, rubrics are often used to score alternative assessment measures. In general, a rubric includes a list of characteristics that responses may include and that will be considered in the evaluation. More specifically, rubrics stipulate the scoring dimensions in terms of content or process (e.g., writing style, introduction, required facts) and a scale of values for evaluating each dimension (e.g., beginner, developing, advanced; some rubrics use a point scale or letter grade). Good rubrics also include clear explanations and examples of expected responses at each level of the scale. Additionally, the individual components in the rubric are often weighted — perhaps, for example, content is weighted more than writing style. Note that, as mentioned previously, instructors can give scoring rubrics to students at the start of a lesson to help students identify and work toward optimal performance.
Standardized Tests
Standardized tests are developed by test construction experts and are used in many different schools or settings — in this case, standardized means that everyone takes the same test in the same way. Standardized tests can include both objective and essay components — the SAT is an example of a standardized test with both selected-response items (i.e., the verbal and math sections) and a free-response section (i.e., the writing test). In the era of high-stakes testing, standardized tests are becoming more common.
Student Self-Assessment
Students self-assess when they evaluate their own work, both process and product. These questions don't lead to increased motivation, either intrinsic or extrinsic, and they probably don't do much for oral expressive language. The focus is on student introspection and metacognition, which are elements of self-evaluation. If used correctly, student self-assessment is most valuable to educators because it provides information on students' strengths and weaknesses that is not revealed through other means. Students in a nonthreatening environment, which can be created through self-assessment, are honest about their abilities, and thus self-assessment can provide information that educators might not otherwise have. The ultimate goal of assessment is the accurate evaluation of students' achievement.
Summative Assessments
Summative assessment is conducted after instruction to assess students' final achievement. Whereas formative assessments generally address the question, "What are students learning?" summative evaluations address the question, "What have students learned?" In other words, summative assessments provide information regarding what students know or have achieved following instruction. Final exams are summative assessments; so are high-stakes achievement tests. As with all assessments, it's critical that any measure designed for summative assessment adhere closely to the learning objectives.
IIIA(5). Self-Assessment and Peer Assessment
Teachers who focus on self-directed learning often encourage students to engage in self-assessment, in which students have input in determining their grades, based on reflection and objective evaluation of their work. In other situations, students evaluate each others' work. In this case, students should have an opportunity to challenge or discuss a peer-assigned grade. In general, self-assessment and peer assessment allow students to serve as agents of their own learning and can lead to increased motivation for schoolwork. However, it's necessary that the teacher guide the process, sometimes by providing standards for evaluation and other times by facilitating a discussion in which students come to agreement regarding those standards and the procedures to follow. Once standards are developed, students can use checklists, rubrics, rating scales, observations, or other tools to identify the extent to which they, or their peers, have met those standards. Journals can be particularly effective for encouraging less formal and more reflective, qualitative assessments.
Raw Scores
The most common type of score, used most frequently for classroom (non-standardized) tests, is the raw score. A raw score indicates the number of correct responses on a particular assessment measure. For example, on a quiz with 10 points, a student can earn 0, 1, 2, 3 points, and so on up to a raw score of 10. Interpreting a raw score requires knowledge of the test — for example, a score of 3 is only useful to someone who knows the total number of questions. For that reason, raw scores are often transformed into criterion-referenced or norm-referenced scores, especially when grades are attached.
Standard Deviation
The most frequently used measure of variance is the standard deviation, a measure of how much the scores differ from the mean. The larger the standard deviation, the more spread out the scores are in the distribution. The smaller the standard deviation, the more the scores are clustered around the mean. For example, if everyone scores a 50% on a test, the mean is 50 and the standard deviation is zero — there's no variance in the scores. If one person scores 52 and one person scores 48 on a test with a mean of 50, the standard deviation is greater than zero but still small; if many students score above 65 and below 40, the standard deviation will be relatively large.
Achievement vice Ability Tests
The primary difference between an achievement test and an ability test or an aptitude test is that an achievement test measures the extent to which a student can perform certain skills after instruction or training. Both achievement and aptitude tests can measure ability to reason, solve problems, and respond to people, things, and events. However, only an achievement test measures performance after instruction, and ability tests, not achievement tests, typically measure the extent to which a student can develop proficiency in a particular area.
