Evaluation and Measurement Study Guide

Ace your homework & exams now with Quizwiz!

positive scatterplot

Source of error

1. The context in which the testing takes place 2. The test taker 3. Specific characteristics of the test itself Measurement error can be systematic and consistent, as well as random

What are the results of black/white mental testing?

1897- G.R. Stetson 500 Black and 500 White public school children in D.C. 4 stanzas of poetry which experimenter read aloud and children required to repeat In this exercise, the Black children outperformed the White children Why wasn't this publicized? Consequently, determined that memory technique not a valid measure of intelligence Early 1900s Black children divided into groups for testing based on skin tone "to determine effect of White blood intelligence" This testing provided "empirical support" for segregation

Statistics

A branch of mathematics dedicated to organizing, depicting, summarizing, analyzing, and otherwise dealing with numerical data Descriptive Statistics Inferential Statistics Other meaning of statistic/statistics: refers to measures derived from sample data Ex. means, standard deviations, correlation coefficients, etc. used to estimate population parameters

Raw scores

A number (x) that summarizes or captures some aspect of a person's performance in the carefully selected and observed behavior samples that make up psychological tests No meaning by itself High score not always the same as favorable result Can be misleading Need frame of reference Raw score is the observed score, original, untransformed score

What does the standardization of a test look like?

A standardized test is a test that is administered and scored in a consistent, or "standard", manner. ... Because everyone gets the same test and the same grading system, standardized tests are often perceived as being fairer than non-standardized test.

the more accurately we can make distinctions that need to be made among them

All other things being equal, the greater the amount of variability there is among individuals, in whatever characteristic we are attempting to assess, the more accurately we can make distinctions that need to be made among them

validity/reliability

Always remember that _____ wants to know what's being tested and ________ wants to know how consistently

validity

An integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment, measures accuracy

constant

Anything that does not vary Very few constants in the world Example: Pi 𝝅

variables

Anything that varies Exist everywhere and can be classified in a multitude of ways Discrete or continuous

Who were group tests of intelligence first used for?

Army Alpha, 8 subtests measuring verbal, numerical and reasoning abilities as well as practical judgement and general info, successful, led to Army Beta, were for people who could not read - Arthur Otis - invention of objective multiple-choice items led to the first group tests of general intelligence

short answer

Assess lower-level thinking skills such as memorization and basic knowledge. Pros 1. Completion or short-answer items are very flexible. They can be used to assess any content area, even if they are somewhat limited by the level of information that can be tested. So it's hard to think of a content area that is not fair game, from engineering to biomechanics to art history to culinary facts. 2. Guessing is minimized. With other types of items, such as true-false or multiple choice, it's much easier to guess. The likelihood of being right on a true-false item (two choices, right?) by chance alone is 50%. The probability of being correct on a completion item by chance alone is a big zero. With completion items, there are no predefined options to select from as a chance effort at getting the item right. This means that the test taker really has to at least have some idea about the content of the material. 3. Short answer and completion are both very attractive item formats for computational items. Using "5 + 5 = ___" is an efficient and straightforward way to find out if the test taker knows the sum of 5 and 5. 4. Short-answer and completion items effectively measure lower levels of cognitive ability. Sure, such items are limited to lower- level cognitive abilities (such as memorization), but that can be an advantage, because if that's what you want to measure, then this is the way to do it well. 5. Completion and short-answer items are relatively easy to write. When you want to assess how well 100 different facts can be recalled and want to write 50 questions about those facts, completion or short-answer items are quick and easy to write. And they can rather easily be written in a nonambiguous and straightforward way, making the test more reliable, since there is less room for error due to poorly written items. 6. Completion items allow for increased item sampling. Remember back in Chapter 3, we pointed out how test reliability increases as the number of items increases? Well, because completion items are easy to construct and test time is usually limited, more of these kinds of items can fit into a standard testing session and provide a broader sample of items that more accurately reflects the overall content being assessed. And with more items, you can sample more content. Cons 1. There's no machine scoring here. You have to grade these kinds of items by hand, and it can be a tedious and time-consuming task. 2. Scoring can be subjective. Although you may try to create a completion item that has only one correct answer, this is hard, and other answers might (and we stress might) have some correct content. It's your call as the test scorer, and if the item is not well written, it can be a tough one. You also have to consider that spelling and grammatical errors in an answer influence the scoring, as does illegible handwriting. And (again) to complicate matters even more, you may consider giving partial credit for partially correct answers! 3. Advanced types of thinking skills cannot be assessed using short-answer items. As we have mentioned several times, short-answer and completion items work best when the learning objectives you are assessing are basic and focus on memorization and understanding of simple ideas. 4. One correct answer may be the requirement for the best of completion items, but it's tough to write those kinds of completion items. Items such as "1.) The United States Secret Service was created in 1862 to ____. (counteract counterfeiting)," could also have answers such as "guard the integrity of the currency" or "make sure that only genuine currency is used for commerce," and so on. One correct answer, when you have a blank space, is a tough order to fill.

properties of the normal curve

Bell shaped Bilaterally symmetrical: its two halves are identical Limits extend to +/- infinity: tails approach but never touch the baseline, a property that underscores the theoretical and mathematical nature of the curve Unimodal: has a single point of maximum frequency or maximum height Mean, median, and mode coincide at the center of the distribution because the point where the curve is in perfect balance, which is the mean, is also the point that divides the curve into two equal halves, which is the median, and the most frequent value, which is the mode

acceptable reliability coefficient

Between 0.8 and 0.7 the ratio of true score variance to total test score variance If all the test score variance were true variance, score reliability would be perfect (1.00) May be viewed as a number that estimates the proportion of the variance in a group of test scores that is accounted for by error stemming from one or more sources For reliability purposes, correlation coefficients tend to range between .00 and +1.00. The higher the (positive) number, the more reliable the test Reliability coefficients are always positive

Know the first published useful instrument

Binet-Simon scale

Range of correlation

Correlational methods are the techniques used to obtain indexes of the extent to which two or more variables are related to each other, indexes that are call correlation coefficients Correlational methods are the major tools we have to demonstrate linkages: Between scores on different tests Between test scores and non-test variables Between scores on parts of tests or test items and scores on whole tests Between part scores or item scores and non-test variables Between scores on different parts of a test or different items from a single test In order to compute a correlation coefficient all we need are the data (i.e., observations) on two variables The relationship between two variables is said to be linear when the direction and rate of change in one variable are constant with regard to the changes in the other variable When plotted on a graph, the data points for this type of relationship form an elliptical pattern that is straight or nearly so If there is a correlation between two variables and the relationship between them is linear, there are only two possible outcomes: Positive correlation Negative correlation If there is no correlation, the data points do not align themselves into any definite pattern or trend, and we may assume that the two sets of data do not share a common source of variance Did the correlation result from chance? The larger the coefficient, the less likely it is that it could be the result of chance If the probability that the obtained coefficient results from chance is very small, we can be confident that the correlation between X and Y is greater than 0. The two variables share a certain amount of common variance The larger and the more statistically significant a correlation coefficient is, the larger the amount of variance we can assume is shared by X and Y CORRELATION ≠ CAUSATION

criterion validity

Def: assesses whether a test reflects a set of abilities in a current or future setting as measured by some other test or evaluation tool Two types: - Concurrent validity: if the criterion is taking place in the here-and-now (around the same time or simultaneously); compare with an established measure Ex. achievement tests and certifications or licensing - Predictive validity: if the criterion is taking place in the future Most often used in entrance exams (ex. GRE) and for employment tests Only need one or the other

interval

Equal intervals between units but no true zero, Identity + rank order + equality of units Ex. Fahrenheit and Celsius temperature scales; calendar time

Multiple intelligences

Howard Gardner's model of multiple intelligences, made up of 8 different types of intelligence; - Musical - Bodily-kinesthetic- control of one's bodily movements - Logical-mathematical - Linguistic - Spatial intelligence - Interpersonal - ability to understand others' behavior and interact with other people - Intrapersonal - ability to understand ourselves - Naturalist - identify and understand patterns in nature · Example: Michael Jordan excelled in kinesthetic intelligence, Yo-Yo Ma in musical intelligence · Gardner believed that traditional tests of intelligence do not provide the flexibility needed by the test taker to show skills in broad range

Practice effects

Improvements in performance resulting from opportunities to perform a behavior repeatedly so that baseline measures can be obtained.

Presentation Tests

Intelligence Tests Adam Moore, Molly West, Jane Peterson - Stanford-Binet Intelligence Scale, 5th Edition (S-B) Michael and Kirsten - Wechsler Adult Intelligence Scale-IV (WAIS-IV) Achievement Tests Jazzy Romans: Peabody Individual Achievement Test (PIAT-R) Tori Connolly, Maddie Wood, Mary Andrews: Woodcock Johnson Tests of Achievement, 4th Edition (WJ-IV) Edvin Mujic Wechsler Individual Achievement Test (WIAT-III) Personality Tests Mackenzie Gilliland and Hannah Rogers- Myers-Briggs Type Indicator Instrument Self-Scorable Form M Brianna Burns, Allie Christian, Felicia Mobley - Personality Assessment Inventory (PAI) Amara Untalan- Minnesota Multiphasic Personality Inventory Projective Tests Amani Miles and Miranda Lewis- Thematic Apperception Test Adrian Lorimor, Emma Brooks, Nellvetta Moore: Rorschach Test) Erica Mullins and Christine Underwood - Kinetic Family Drawing Vocabulary Test Marvea Johnson, Jenson Maydew, Alex Crosser- Peabody Picture Vocabulary Test - 4th Edition (PPVT-4) Neuropsychological tests Mikey, Alex, and Niki - Wechsler Memory Scale, 4th Edition (WMS-IV) Alyssa Collina and Kristina Brooks, shelby birchler Delis-Kaplan Executive Function System (D-KEFS)

nominal

Numbers are used instead of words, Identity or equality, Ex. SSNs; football players' jersey numbers; numerical codes for non-quantitative variables, such as sex or psychiatric diagnoses

ordinal

Numbers are used to designate an orderly series, identity + rank order Ex. Ranking of athletes or teams; percentile scores

What is test bias, example?

Occurs when test scores vary across different groups because of factors that are unrelated to the purpose of the test If one predicts test score based on group membership, then the observed score will either over- or underestimate the test taker's true score If this happens consistently for members of a given group = evidence of test bias The result of an analysis that we can learn to apply Systematic difference in test scores as function of gender - Language use: "comparing apples-and-oranges" would probably be biased against those who have recently immigrated from a country whose native language is not American English

portfolio

PROS They are flexible They are highly personalized for both the student and the teacher They are an attractive alternative to traditional methods of assessment They are possibly a creative method of assessment when other tools are either too limiting or inappropriate CONS They are time-consuming to evaluate They do not cover all subjects well, nor can they be used with all curriculum types They are relatively subjective

interview

Pros 1. They can be rich, detailed, and full of worthwhile information. The primary advantage of the interview is that it provides rich detail that is often impossible to obtain using any other assessment technique. How else can you find out what or who was the inspiration for a great novel or a revolutionary experiment? 2. Interviews are personal. Assessment procedures in general are pretty objective and far removed from personal interaction. Interviews provide the opportunity to "get the inside story" and get to know the people being interviewed as well—both rewarding takes. Cons 1. Interviews are subjective. Well, sure they are, and that's the nature of the process. In all fairness, to call interviews subjective is to misunderstand the utility of the technique. They are not so much subjective as they are highly individualized and should be used as such. 2. Interviews are time-consuming. No doubt about that, and there's probably no way to get around the expense.3. Interviews allow little generalizability. This may very well be true, but if one does enough interviews on a circumscribed topic (such as why one entered the social work field) and is careful when assembling and analyzing the data, some generalizability will be present.

essay

Pros Let's you know how well a test taker understands ideas and can relate ideas to one another Tap into how well test takers can organize and integrate info Provides opportunities to demonstrate creativity Flexible in form and purpose They help find out how ideas are related to one another They increase security They provide increased flexibility in item design They are relatively easy to construct Cons Not all test takers are strong writers- doesn't mean they don't know the material Hard to remain neutral while scoring They emphasize writing They are difficult to write They provide an inadequate sampling of subject matter They are hard to score They emphasize writing skills over content

Skewness (positive, negative)

Refers to the lack of symmetry; a skewed distribution is asymmetrical - Negative skew: if most of the values are at the top end of the scale and the longer tail extends toward the bottom (Sk<0) - Positive skew: if most of the values are at the bottom and the longer tail extends toward the top of the scale (Sk>0)

sampling error

Refers to the variability inherent in test scores as a function of the fact that they are obtained at one point in time rather than at another Whatever construct or behavior a test evaluates is liable to fluctuate in time Some of the constructs and behaviors assessed through tests are either less subject to change, or change at a much slower pace than others Expect less time sampling error in the scores of tests that assess relatively stable traits (vs. states)

What are two types of aptitude tests?

SAT Differential Aptitude Test (DAT)

Criterion referenced

Seek to evaluate the performance of individuals in relation to the actual construct itself Criterion-referenced test interpretation makes use of at least 2 underlying sets of standards - Those that are based on the amount of knowledge of a content domain as demonstrated in standardized objective tests - Those that are based on the level of competence in a skill area as displayed in the quality of the performance itself or of the product that results from exercising the skill

Know which career test is self -administered and scored (the word self is in the answer)

Self-Directed Search (SDS)

What is the Education for All Handicap Act of 1975?

Signed into law by President Gerald Ford in 1975 A statement of affirmation that special needs children have the right to a free and appropriate public education in the least restrictive environment (LRE) 4 purposes: - Assure that children with disabilities have a free appropriate public education available to them that emphasizes special education to meet unique needs - Assure that the right of children with disabilities and their parents are protected - Assist States and localities to provide for the education of all children with disabilities - Assess and assure the effectiveness of efforts to educate all children with disabilities

Measures of variability

Stats that describe how much dispersion, or scatter, there is in a set of data Help us to place any given value within a distribution and enhance the description of a data set range, semi-interquartile range, variance, standard deviation

Measures of central tendency

Tell us where the bulk of the data can be located as well as the data's most representative or central value mean, median, mode

Kurtosis

The amount of dispersion in a distribution - Platykurtic: distributions that have the greatest amount of dispersion, manifested in tails that are more extended - Leptokurtic: distributions that have the least amount of dispersion - Mesokurtic: the normal curve; intermediate degree of dispersion

What is test fairness, example?

The degree to which a test fairly assesses an outcome independent of traits and characteristics of the test taker unrelated to the focus of the test Touches on the very sensitive issue of the use of tests and the social values that underlie such use A question of values and judgment College admissions counselor How much weight should SAT/ACT scores hold in the admissions process? For example, female students tend to score lower than males (possibly because of gender bias in test design), even though female students tend to earn higher grades in college on average (which possibly suggests evidence of predictive-validity bias) To cite another example, there is evidence of a consistent connection between family income and scores on college-admissions exams, with higher-income students, on average, outscoring lower-income students. The fact that students can boost their scores considerably with tutoring or test coaching adds to the perception of socioeconomic unfairness, given that test preparation classes and services may be prohibitively expensive for many students

Be able to apply the discrimination and difficulty index (find this in a video in a powerpoint)

The difficulty index tells us how many people got that item correct The difficulty index is a measure of the percentage of responses that are correct in the high and low groups, the discrimination index is a measure of how effectively an item discriminates between the high and low groups.

error score

The hypothetical difference between an observed score and a true score

Premises

Traditionally, the premises in a matching item are designated using numbers, and the options or responses are designated using letters (either capital or lowercase), premise on the left in matching, options on the right. ex. in the photo: 1. read 2. write 3. sit would be the premises

Know the kinds of variables (ex. Eye color)

Visible variables: sex, eye color Invisible variables: personality, intelligence Small set variables: number of children in a family Large set variables: average income of individuals in a country

computerized adaptive testing

When individuals take CATs, their ability levels can be estimated based on their responses to test items during the testing process; these estimates are used to select subsequent sets of test items that are appropriate to the test takers' ability levels Shorten test length and testing time significantly Presents problems related to test security, test costs, and the inability of examinees to review and amend their responses Used by ETS-- tests that determine if you get harder or easier question based on correct/incorrect answers

Parallel forms reliability

When you want to know if several different forms of a test are reliable or equivalent

Internal consistency reliability

When you want to know if the items on a test assess one, and only one, dimension

test-retest reliability

When you want to know whether a test is reliable over time AKA time sampling Used when you want to examine whether a test is reliable over time A must when you are examining differences or changes over time You must be very confident that what you are measuring is being measured in a reliable way such that the results you are getting come as close as possible to the individual's true score each and every time Find correlation between each test to find out if reliable Time administration between the 2 test administrations always has to be specified

interrater reliability

When you want to know whether there is consistency in the rating of some outcome Related to reliability of those administering the test (vs. test instrument) How much do two raters agree on their judgments of the same outcome? Interrater reliability = Number of agreements/ Number of possible agreements

ratio

Zero means none of whatever is measured; all arithmetical operations are possible and meaningful, Identity + rank order + equality of units + additivity Ex. Measures of length; periods of time; birth order

negative scatterplot

discrimination index

a measure of how effectively an item discriminates between the high and low groups step 1: Grade tests and sort students from high grade score to low grade score step 2: group top and bottom 27% high and low performers step 3: calculate item difficulty for each group, only students in top and bottom 27% group step 4: subtract top group item difficulty from the bottom group item difficulty to get DISCRIMINATION INDEX good index of discrimination: range from -1.0 to 1.0 anything lower than 0 should be rewritten or discarded. Anything between -1.0 and 0, negatively discriminations, meaning that the lower performers are doing better than the higher performers, there must be something wrong with it. 0-.2 is not doing a good job .4-.6 are great discrimination indexes 1 = PERFECT

Standard error of measurement

a measure of how much inconsistency or error one might expect in an individuals observed score - The amount of variability one might expect in an individual's true score if that test is taken an infinite number of times

Personality trait

a pattern of thought, emotion, and behavior that is relatively consistent over time and across situations

reliability

a quality of test scores that suggests they are sufficiently consistent and free from measurement error to be useful, trustworthiness

t-score

a standard deviation — a mathematical term that calculates how much a result varies from the average or mean.

Personality type

a style of personality defined by a group of related traits

true-false

achievement tests Pros 1. They're convenient to write and administer. A bunch can be administered in a short amount of time, because they are relatively short. 2. They are very easy to score. If well written, the answers are either correct or incorrect. Cons 1. How true is true? You read above that one of the things a good true-false item should do is present one idea clearly and unequivocally. Yep, that's the case. But it is a double-edged sword. How true is true? 2. True-false items place a premium on memorization. It's tough to get beyond the most basic levels of knowledge with true-false items without violating some of the write 'em rules we mentioned earlier. Now in some ways, that's okay if your goal is to sample from a universe of items that is knowledge based, but you should recognize this limitation and not think you can easily tap into the higher levels of knowledge about neuroanatomy or higher mathematics using true-false items. 3. It's pretty easy to guess right. Now, this may seem a bit silly, because it's pretty easy (or rather, easier) to guess wrong on a true-false item as well, right? The probability of being right (or wrong) is 50%, and by chance alone, if the test taker selects T or F on a somewhat random basis, the final score will be about 50%. But the odds of guessing right are much higher than for any other type of item we have covered or will cover. As long as you realize this limitation, true-false items can still do a good job. So it would be wise for you to have a standard for passing way, way above 50%.

Item analysis

an in-depth analysis of whether the item did what it was supposed to-- that is, discriminate between those who know the material and those who do not Generate two different indexes for each item: - Difficulty index - Discrimination index Remember, must do item analysis, one item at a time-- not the entire test

cultural validity

an indication of how well the constructs of the theory function across cultures, whether the hypothesized relationships among constructs are similar across cultures, and whether the hypothesized relationships to external criteria are similar across cultures

Distractor

are a special type of alternative. These are the incorrect alternatives whose job it is to distract the test taker into thinking that the incorrect alternative is correct, but not to be so distracting that if the test taker knows the correct answer, he or she cannot identify it. The best alternative responses are plausible but never correct.

mean

arithmetic average - Its actual value may or may not be represented in the data set - The measure of central tendency most influenced by extreme scores

matching

assess a content area—be it history, biology, statistics, or the regulations governing NASCAR races. Like multiple-choice items, they are basically assessment tools that measure an individual's knowledge of a particular subject and the associations between ideas, involve selection Pros 1. Matching items are easy to score because the format of the premises and responses is straightforward and very clear. The response on a matching item is absolutely unambiguous. Most of the time, test takers enter the letter of a response that matches a premise, but you can also have them draw a line connecting the two items that match. And because these items are so straightforward in their presentation, the test format tends to be reliable. 2. Matching items are easily administered to a large number of test takers. The directions can be very straightforward, and if the items are written following the guidelines we identified above, the testing sessions should go smoothly.3. Matching items allow for the comparison of ideas or facts. The responses can consist of several different ideas, facts, or observations—all of which need to be compared with one another. 4. As we mentioned earlier, when multiple-choice items need to have the same response, matching items are perfect.5. Responses are short and easy to read (or at least they should be).6. Matching items de-emphasize writing ability. This is true, of course, as with all objective (multiple-choice, true-false, etc.) types of items. You just don't have to write well to be able to demonstrate knowledge of the material. And if you are interested in whether a medical student knows the name of the cervical nerves in the spinal column, whether he or she can also write poetry becomes irrelevant.7. Finally, the matching item format actually decreases the value of guessing. With most multiple-choice items, there's a 25% (1 in 4) or a 20% (1 in 5) chance of getting it right by guessing. When you are matching, there are probably 6 to 10 options for each premise. The odds are then reduced to about 17% (1 in 6) or even 10% (1 in 10). Cons 1. When matching items are used, the level of knowledge tested is limited. We're dealing mostly with factual information here, so it's pretty easy to test the year in which Gone With the Wind was published (1936), who invented the lightbulb (Thomas Edison), or how much 104.5 equals (31,622.78). But if you want a comparison between different theories of development or a discussion of the advantages and disadvantages of socialism, go elsewhere, such as short-answer or essay items.2. Matching items are useful only when you can generate a sufficient number of options. You may be testing a topic that is so finite and narrow there are not very many attractive options that are incorrect, and it would be difficult to write a good item that fairly assesses the test taker's knowledge of the topic at hand. In other words, you end up short on options.3. Scoring can be a problem. Although matching terms are presented in an objective framework, it is more difficult to machine score them, because the test taker usually writes directly on the test itself and not on a scoring sheet, such as those used with multiple-choice or true-false tests.4. If your memory is good, then matching items are for you. One of the criticisms of matching items is that they emphasize memorization of facts rather than higher-order thinking skills. Okay, that's a legitimate concern, but that's also usually why people create such items—to test basic knowledge, with memorization being an important and basic component of much of higher-order learning and understanding.

face validity

claimed to be present if the items on a test appear to adequately cover the content or if a test expert thinks they do Face validity is more like "approval" validity It's the general impression that the test does what one thinks it should Face validity is more or less a social judgment rendered by some outside person without the application of any external criterion (like already discussed) MMPI-2 has strong face validity and the Rorschach has low face validity

Regression

discovered by Galton; a measure of the relation between the mean value of one variable (e.g. output) and corresponding values of other variables (e.g. time and cost). Regression analyses have given us a basis for making predictions about the value of Variable Y, based on knowledge of the corresponding value of Variable X, with which Variable X has a known and significant degree of correlation

Percentage

eflect the number of correct responses that an individual obtains out of the total possible number of correct responses on a test frame of reference is the content of the entire test (%)

construct validity

examines how well a test score reflects an underlying construct The most interesting, ambitious, and difficult of all the validities to establish Constructs are not easily defined concepts so the tests that measure them are difficult to construct and validate

Sampling distributions

hypothetical distributions of values predicated on the assumption that an infinite number of samples of a given size could be drawn from a population

test floor

if a person fails all the items presented in a test or scores lower than any of the people in the normative sample, the problem is one of insufficient test floor

Test ceiling

if a test taker reaches the highest score attainable on an already standardized test, maximum difficulty level of the test, is insufficient

Percentile

indicates the relative position of an individual test taker compared to a reference group, such as the standardization sample; specifically, it represents the percentage of persons in the reference group who scored at or below a given raw score -Percentile rank scores are the most direct and present method used to convey norm-referenced test results; readily understood -Higher percentile scores = higher raw scores in whatever the test measures -50th percentile = median and group's mean level of performance score that reflect the rank or position of an individual's performance on a test in comparison to a reference group frame of reference is other people

Normal Curve aka Bell curve

its baseline (x axis) shows the standard deviation units; its ordinate (y axis) is not shown because it is not a frequency distribution but a mathematical model of an ideal or theoretical distribution Many chance events, if repeated a sufficiently large number of times, generate distributions that approximate the normal curve

Item response theory

latent trait models; apply mathematical models to test item data from large and diverse samples; place both persons and test items on a common scale Latent trait: reflects the fact that these models seek to estimate the levels of various, unobservable, abilities, traits, or psychological constructs that underlie the observable behavior of individuals, as demonstrated by their responses to test items

basal age

lowest point on test where test taker can pass two consecutive items of equal difficulty

achievement tests

measure how much someone knows or has learned, how much knowledge he/she has in a particular subject area most common tests · Norm-referenced tests - allow you to compare one individual's test performance with the performance of other individuals, standardized tests accompanied by norms, typically not for teacher made tests · Criterion-referenced tests - refer to specific level of individual's performance (a function of mastery in some content domain), focus on mastery of content at a specific level, content reference tests/standard referenced tests. Measure SUCCESS

personality tests

measure various aspects of personality, including motives, interests, values, and attitudes

mode

most frequently occurring value in a distribution Useful primarily when dealing with qualitative or categorical variables

selection items

multiple choice, true/false, matching

What's required for a normative sample?

often used as synonymous with the standardization sample, but can refer to any group from which norms are gathered. Additional norms collected on a test after is is published, for use this a distinct subgroup, may appear in the periodical literature or in technical manuals published at a later date

semi-interquartile range

one half of the interquartile range (IQR), which is the distance between the points that demarcate the tops of the first and third quarters of a distribution IQR: the range between Q1 and Q3; encompasses middle 50% of the distribution

standard deviation

positioned at equal distances along the X-axis, at points that mark the inflections of the curve itself If you add all the percentages in the areas above the mean, the result equals 50% (same with below the mean) The area between +1 and -1 standard deviation is 68.26% (~⅔ the curve) and the area between +2 and -2 standard deviations if 95.44% (almost the entire curve)

Norm referenced

seek to locate the performance of one or more individuals with regard to the construct the tests assess, on a continuum created by the performance of a reference group, always people

Stem

sets the premise for the question and comes before any of the alternatives appear (in this example, the stem is, "If the hanging wall has moved down, the fault is")

supply items

short answer, essay

What does SAT stand for?

stands for nothing, used to mean standardized achievement test

Neuropsychology

study of the psychological effects of brain damage in human patients - When soldiers showed patterns of deficits involving problems with abstract thinking and memory, as well as planning and execution of relatively simple tasks, known under the rubric of organicity; synonym for brain damage

Psychological test

systematic procedure for obtaining samples of behavior, relevant to cognitive, affective, or interpersonal functioning, and for scoring and evaluating those samples according to standards Systematic procedures: planning, uniformity, thoroughness; objective and fair Samples of behavior: small subsets of a mucher larger whole; efficient b/c time limited Relevant to cognitive, affective, or interpersonal functioning: samples are selected for their empirical or practical psychological significance; tools Evaluated and scored: some numerical or category system is applied to test results, according to pre-established rules Standards: based on empirical data to apply common criterion to test results

difficulty index

tells us how many people got that item correct FORMULA: item difficulty = number of people who answered correctly/(DIVIDED BY) number of people who answered EX. .78 = 78/100 .35 = 35/100 want the range to be somewhere in between .4 - .6 difficulty levels below .3 mean that the question is TOO HARD difficulty levels above .8 mean that the question is TOO EASY if test is taken multiple times, only analyze first response

projective tests

tests designed to reveal inner aspects of individuals' personalities by analysis of their responses to a standard series of ambiguous stimuli

scorer reliability

the correlation between the sets of scores generated when two judges Want very high and positive correlations (at least .90)

range

the distance between two extreme points--the highest and lowest values-- in a distribution Easily computed but very unstable because it can change drastically due to presence of one or more extreme scores

Standardization sample

the group of individuals on whom the test is originally standardized in terms of administration and scoring procedures, as well as in developing the test norms. Data for this group are usually presented in the manual that accompanies a test upon publication

mean

the point where the curve is in perfect balance

Content Validity

the property of a test such that the test items sample the universe of items for which the test is designed The extent to which a measure represents all facets of a given social construct Most often used for achievement tests (e.g., SAT, WJ-Achievement, quizzes for this class)

true score

the real score on the variable

standard deviation

the square root of the variance - Along with the variance, it provides a single value that is representative of the individual differences or deviations in a data set--computed from a common reference point, namely the mean - A gauge of the average variability in a set of scores, expressed in the same units as the score - Quintessential measure of variability for testing as well as many other purposes and is useful in a variety of statistical manipulations

Standard error

the standard deviation of the sampling distribution that would result if we obtained the same statistic from a large number of randomly drawn samples of equal size

Variance

the sum of the squared differences or deviations between each value in a distribution and the mean of that distribution, divided by n. AKA variance = the sum of squares: an abbreviation for the sum of the squared deviations values or deviation scores - Represents the total amount of variability in a score distribution and the variance represents its average variability

Correlational methods

the techniques used to obtain indexes of the extent to which two or more variables are related to each other, indexes that are call correlation coefficients Correlational methods are the major tools we have to demonstrate linkages: - Between scores on different tests - Between test scores and non-test variables - Between scores on parts of tests or test items and scores on whole tests - Between part scores or item scores and non-test variables - Between scores on different parts of a test or different items from a single test - In order to compute a correlation coefficient all we need are the data (i.e., observations) on two variables

median

the value that divides a distribution that has been arranged in order of magnitude into two halves - If the number of values is odd, the median is the middle value; if n is even, median is the midpoint between the two middle values

multiple choice

used to assess some area of knowledge such as introductory chemistry, advanced biology, the written part of the Red Cross CPR test, the national boards in nursing, automotive mechanics, internal medicine, and so on, objective and flexible Pros 1. Multiple-choice items can be used to measure learning outcomes at almost any level of instruction. This is the big one, and we have mentioned it before. This allows multiple-choice items to be very flexible and useful anytime you are sure test takers can adequately read and understand the content of the question. 2. They are clear and straightforward. Well-written multiple-choice items are very clear, and what is expected of the test taker is clear as well. There's usually no ambiguity ("How many pages should I write?" "Can I use personal experiences?" etc.) about answering the test questions. 3. No writing needed. Well, not very much anyway, and that has two distinct advantages. First, it eliminates any differences between test takers based on their writing skills. And second, it allows for responses to be completed fairly quickly, leaving more time for more questions. You should allot about 60 seconds per multiple-choice question when designing your test. 4. The effects of guessing are minimized, especially when compared with true-false items. With four or five options, the likelihood of getting a well-written item correct by chance alone (and that's exactly what guessing is) is anywhere between 20% and 25%. 5. Multiple-choice items are easy to score, and the scoring is reliable as well. If this is the case and you have a choice of what kind of items to use, why not use these? Being able to bring 200 bubble scoring sheets to your office's scoring machine and having the results back in 5 minutes sure makes life a lot easier. And when the scoring system is more reliable and more accurate, the reliability of the entire test increases. 6. Multiple-choice items lend themselves to item analysis. We'll talk shortly about item analysis, including how to do it and what it achieves. For now, it's enough to understand that this technique allows you to further refine multiple-choice items so they perform better and give you a clearer picture of how this or that item performed and if it did what it was supposed to do. For this reason, multiple-choice items can be diagnostic tools to tell you what test takers understand and what they do not. Cons 1. Multiple-choice items take a long time to write. You can figure on anywhere between 10 and 20 minutes to write a decent first draft of a multiple-choice item. Now, you may be able to use this same item in many different settings, and perhaps for many different years, but it's a lot of work nonetheless. And once these new items are administered and after their performance is analyzed, count on a few more minutes for revision. 2. Good multiple-choice items are not easy to write. Not only do they take a long time, but unless you have very good distracters (written well, focused, etc.) and you include only one correct answer, you will get test takers who can argue for any of the alternatives as being correct (even though you think they are not), and they can sometimes do this pretty persuasively. 3. Multiple-choice items do not allow for creative or unique responses. Test takers have no choice in how to respond (A or B or C or D). So if they would like to add anything more or to show what they know beyond what is present in the individual item, they are out of luck! 4. The best test takers may know more than you! Multiple-choice items operate on the assumption that there is only one correct alternative. Although the person who designs the test might believe this is true, the brightest (student and) test taker may indeed find something about every alternative, including the correct one, that is flawed.

criterion contamination

when a criterion variable is related to what is being measured Examples: - If you are looking for the predictive validity of real-estate selling, don't use as a criterion sales figures in another area, such as cars or appliances. - If you have judges rating students on some external criterion, be sure the judges know nothing about the students' previous performance, study habits, attitudes, etc.

Normal curve/standard normal distribution

when the normal curve has a mean of 0 and a standard deviation of 1 In a normal curve, the standard deviation is positioned at equal distances along the X-axis, at points that mark the inflections of the curve itself If you add all the percentages in the areas above the mean, the result equals 50% (same with below the mean) The area between +1 and -1 standard deviation is 68.26% (~⅔ the curve) and the area between +2 and -2 standard deviations if 95.44% (almost the entire curve)

psychological tests

· A psychological test is a systematic procedure for obtaining samples of behavior, the behaviors sampled by tests are relevant to cognitive, affective or interpersonal functioning and for scoring and evaluating those samples according to standards

table of specifications

· Grid with either one or two dimensions, guide to construction of achievement test · Least sophisticated to most sophisticated specifications; - Knowledge - Comprehension - Application - Analysis - Synthesis - Evaluation · Knowledge level - list all the reasons why reliability is important. Recall of information, dates, events and places · Comprehension = understanding, interpret facts, compare and contrast · Application - use of information, methods, concepts and problem-solving · Analysis level - patterns, hidden meanings · Synthesis level - old ideas to make new ones, generalize · Evaluation level - compare and discriminate between ideas and make choices, assess

intelligence tests

· Intelligence is inferred by the way people behave, and the way they behave, we call tests · IQ = intelligence quotient · Intelligence - a construct or group of related variables, such as verbal skills, memory, mechanical skills, comprehension and more, and has some theoretical basis to it

Variability

· The range is the difference between two extreme points - the highest and lowest values in a distribution, larger the difference, the more spread out the data is · Semi-interquartile range - one half of the interquartile range (IQR), the distance between the points that demarcate the tops of the first and third quarters of a distribution. 25th percentile is the lowest quartile (Q1), of the distribution, the third quartile point is the 75th percentile (Q3), at the top third quarter of the distribution, marks the beginning of the top quartile. So the interquartile range is between the Q1 and Q3; middle 50% of the distribution. Example: 25th percentile is a score of 37, the 75th percentile is 44, the interquartile range is 44-37 = 7. Then the semi-interquartile range is 7-2=3.5. · The variance is the sum of the squared differences or deviations between each value (X) in a distribution of the mean of that distribution, M divided by n. · The standard deviation is the square root of the variance

What is the Flynn effect?

· We get smarter as we get older · 1994 study about scores on IQ tests over past 60 years increased from one generation to the next, between 5 and 25 points · If scores continue to increase, the tests would have to be re-normed or re-standardized every few years for consistency · Nutrition may be a factor in this increase upward trend/increase in test scores (i.e., increase in IQ) Reasons for this: people may be becoming smarter, increased education and exposure to new ideas, better nutrition If scores continue to increase, it means that the tests have to be renormed or restandardized every few years so that there will be consistent and accurate standards across all test takers which is expensive, time-consuming and controversial

Evaluation and Measurement Study Guide

Related study sets

Business Ethics Final

Research Methods Exam 1

Landforms Exam 1 (Ch1,Ch2,Chp3) University of Memphis

CLT Book Glossary

Spanish I Quiz I

Chapter 11 Quiz

Math 32: Module III Linear Equations and Inequalities in One Variable

ARM 402 Practice Exam Questions - Ch 4, 5, and 6

Chapter 3: Recruiting the Best Staff for Your Facility

ECON EXAM 3

Chapter 34: Management of Patients With Hematologic Neoplasms

JROTC_ LE 100 1-6

CO-101 Final Study Set #2

AP CH5

TEAS Math Section

Physics Exam I

Kinn's Chapter 15 & 18 Study Guide

Medical Skills Test- Applications/Resumes

chapter 8

Sexual Assult Training - Abuse