PSYCH 469 TERMINOLOGY

¡Supera tus tareas y exámenes ahora con Quizwiz!

MASTERY OF OBJECTIVES

CH. 10

WECHSLER SCALES

CH. 11 PG. 390

WOODCOCK-JOHNSON PSYCH. ED. BATTERY 3RD EDITION

CH. 11 PG. 393

GENERAL APTITUDE TEST BATTERY [GATB]

CH. 11 PG. 411

BELL CURVE

CH. 11 PG. 414 NAME OF A BOOK

META-ANALYSES

CH. 11 PG. 415

PREDICTIVE

CH. 11 PG. 416

TAXONOMIC

CH. 11 PG. 416

VALIDITY GENERALIZATIONS

CH. 11 PG. 416

FORM ALPHA [VERBAL]

CH. 11 PG. ????

FORM BETA [PERFORMANCE]

CH. 11 PG. ????

IMPLICIT ATTITUDES

CH. 11 p 369-371 Limitation to attitude assessment. Rater attitudes that influence our feelings and behaviours at an unconscious level

DISCRIMINATION FACTOR

CH. 11 p. 340 Deals with letters of rec. Method of reviewer [i.e.potential employer] to discriminate between described characteristics and those relevant to the job.

RATING SCALE

CH. 11 p. 341 Developed to get and quantify information on individuals based on common set of traits that are expected from all individuals.

GENEROUSITY ERROR

CH. 11 p. 348 Unwillingness to give low rating

VALIDITY

One of the three major categories of validity in which the focus is on whether the instrument's content adequately represents the domain being assessed.

PRODUCE-RESPONSE ITEMS [Content Related Evidence of Validity]

One of the two categories [the other is SELECT RESPONSE ITEMS] that types of test items can be classified into. -Examinees produce their own answers; sometimes called SUPPLY RESPONSE/CONSTRUCTED-RESPONSE ITEMS

face validity

the degree to which test items appear to be directly related to the attribute the researcher wishes to measure

chronological age

the number of months or years since an individual's birth

SCALES OF MEASUREMENT [for scores]

-Nominal Scale: Verbal Label -Ordinal Scale: Ranking -Interval Scale: Equidistance -Ratio Scale: [+,0,-] Score Correlation

Reliability expressed in two ways:

1.Standard error of measurement 2.Reliability co-efficient

mental retardation

A condition of limited mental ability, indicated by an intelligence score below 70 and difficulty in adapting to the demands of life

Split-half reliability

A measure of the correlation between test taker's performance on different halves (e.g., odd and even numbered items) of a test

CH. 12

APTITUDE TESTS

INFERENCE

An INFERENCE is judged to be valid when there is sufficient rational and empirical evidence supporting it and evidence supporting conflicting evidence is lacking -An inference is considered biased when it is 'differentially valid' for different groups.

CRITERION

A specified measure of success for given constructs/traits CRITERION PROBLEM: -Which criterion measure is best, valid as a predictor of future performance? -Is in the portion of the definition of empirical validity when it refers to 'suitable criterion measure'; which is the most difficult to determine: -some are difficult to measure/quantify i.e. relationship with parents, clients -influenced by factors outside control of individual; student home-life, health, nutrition -effectiveness of equipment, technology or lack there-of -All criterion measures are partial in the sense that they measure only a part of success on the job or in academics SOLUTIONS: -Subjective: Rating scales [can be erratic; dependent on person giving rating] -Tests of Proficiency i.e. university entrance english/math test used to validate/predict performance on comprehensive university english/math exam -Using average of grades in some educational or training program i.e. tests for selections of engineers, lawyers, welders based on educational or training program scores QUALITIES IN CRITERION MEASURE [based on importance] 1. RELEVANCE -Corresponds to the content validity of the criterion measure and the behaviours/trait it is to measure -Rely on professional judgment to determine if criterion is relevant 2. BIAS FREE -Each capable participant has same opportunity to succeed or achieve the same score without regard to wealth, socio-economic status, gender, sex, race, religion etc. -A criterion method that contains substantial bias cannot at the same time reveal relevant differences among people that the trait is measuring 3. RELIABILITY as it applies to criterion score -Must be reproducible/stable if it is to be used as a predictive tool/measurement. The tool must garner same results despite who/what gives and when given i.e. Teacher A should get same evaluation from Supervisor A as from Supervisor B 4. AVAILABILITY/Convenience -How long to get criterion score -How long to get criterion tool -$$$ cost in dollars or disruption of routine

PUBLISHED CURRICULAR TESTS

CH. 7

TESTING ISSUES [Cost/Time/Ease of Scoring & Using]

ECONOMY 1.Cost: -Economy depends on cost of test materials and scoring services per examinee and reuse of test materials i.e. reusable test booklets or use of scheduling to allow maximum users in the least amount of time 2.Time /Savings: - Can be false economy as reliability of test depends largely on its length. -Reduction in testing time is often at cost of precision and breadth of appraisal unless computer adaptive test is used BUT have to have enough computers i.e. provincial exams -Time needed to score/evaluate tests 3. Ease of Scoring: -Scoring by hand burdensome/time issues [Standford Binet & Wechsler scored only by hand] -Scoring by psychologist/expensive [teacher does WIAT-Pysch. does WISC] -Scoring by publishers i.e FSAs[Foundation Skills Assessment gr. 3, 7] -Option of scoring by hand or sending to test scoring services Computer Scoring -Provides various stats [means, modes, medians SEE CH. 3] for individual schools, overall district -Scoring Rubric -Potential test users should check what type of scoring services provided i.e. software -Some test authors provide computerized test interpretations of scores as well as narrative. Protects examinee from bias or human error caused by lack of tester experience. Ensures all aspects are covered, uniformity, consistency. Solves time factor. MUST be reviewed by counsellor or clinician who knows other pertinent data about client FEATURES FACILITATING TEST ADMIN./ EASE OF TESTING 1. Clear, full instructions; with plenty of practice items to reduce bias against those who are unfamiliar with testing situation [task unfamiliarity] 2. Few separately timed units and closing/ending time is not crucial. Errors in timing can occur; skewing results/interpretations and even recommendations 3. Page layout of test items [visually crowded/confusion from knowing which diagram etc. goes with which question]. Can adversely effect scores/skew results FEATURES FACILITATING INTERPRETATION AND USE OF SCORES [SEE CH. 6] 1. Statement of Functions the Test designed to Measure and General Procedures test was developed -Rationale: Achievement tests primary concern is measuring specific content areas cognitive processes. Manual should tell how content and analysis of functions being measured were done 2. Detailed Instructions/Test Administration -Poor instructions may impact validity, scores, interpretation, use of scores 3. Scoring Keys/Instructions -If test must be scored locally/manually clear detailed instructions should be provided re: -errors, scores to be computed, scoring rubric -Electronic scoring[on-line too] need clear instructions 4. Norms for Appropriate Reference Groups:Responsibility of test producer to develop suitable norms following accepted validation/reliability methods 5. Evidence of Reliability Test: Include: a.)Reliability statistics b.)Operations used to obtain reliability estimates c.)Descriptive/statistical characteristics of each group on which reliability data based d.)If available in more than one form, test should give the correlations between both tests and any data derived from a single testing e.)If test yields part scores, reliability data should be given f.)Standard error of measurement given -If given at each of a number of score levels this is good as it shows range of scores over which the test retains its accuracy g.)Reliability coefficients given h.)Intercorrelations of subscores given

CONTENT-RELATED VALIDITY: CRITERION-REFERENCED TESTS

Primarily a matter of professional judgment and subject experts. Criterion-referenced tests are concerned with the measurement of particular, narrowly defined instructional objectives. Because of this the critical concern in assessing validity is CONTENT-RELATED evidence. Includes 2 elements: 1. Information content of the domain 2. What individual should be able to do with it, -Relates to test score and all factors that affect it including: -Clarity of directions -Adequacy of scoring procedures TESTING CONTENT VALIDITY CRITERION-REFERENCED TESTS -To appraise validity of content, use empirical or statistical evidence procedure that maximizes inter group differences to separate those that have achieved mastery from those that have not; The converse is also true. It should minimize intra-group differences WHY???? Because the ideal mastery-oriented test should divide the group into two homogeneous groups of individuals based on those who have and have not achieved mastery DEFINING MASTERY APPROACHES 1. -Common approach is for teacher to identify those students he/she certain have achieved mastery and those they are certain have not -This method yields to clear groups BUT the middle pupils must be omitted 2. -Compare those who have received instruction on concept to those who have not 3. -Compare performance of students pre/post instruction NOTE: -Each method has limitations affected by unreliability of difference or change in scores -Any one of the changes provides some method of joint validity of the instructional content being measures and their mastery VALIDATION PROCESS 1. Examinees divided into two groups; masters vs non-masters OR instructed vs non-instructed 2. Proportion in each group correctly answering each item is determined 3. The valid items are those which the master group scored higher on compared to the non-master group. In some extreme cases the masters should get 100% while the non-masters should experience a chance rate of success - The total test should divide the examinees into masters [all who achieve perfect/near perfect scores and non-masters [all who score at about the chance level -In the ideal test, there should be no misclassification -Mastery is seldom all or nothing -In any class there will be varying degrees of mastery PRO OF CRITERION MASTERY TESTING -We know in theory what students who have mastered can do -Knowing this is an essential part of formative assessing CON OF CRITERION MASTERY TESTING -Deciding on an acceptable definition of mastery But there are many situations, both in the school/other where question asked is one of relative position [ranking]. Summative assessments often call for norm-referenced tests that focus explicitly on relative performance ******************************************************* - One of the three major categories of validity in which the focus is on whether the instrument's content adequately represents the domain being assessed. Evidence of content-related validity is particularly important in achievement tests. -Judgment by experts of the degree to which items, tasks, or questions on a test adequately represent the construct

True score

The real score

analytical intelligence

assessed by intelligence tests, which present well-defined problems having a single right answer

fluid intelligence

the aspect of intelligence that involves the ability to see complex relationships and solve problems

formal assessment

the systematic procedures and measurement instruments used by trained professionals to assess an individual's functioning, aptitudes, abilities, or mental states

HETEROTRAIT MONOMETHOD CORRELATIONS [Construct-Related Evidence of Validity] PREDICTIONS ABOUT CORRELATIONS

********************************************* THE VALIDITY DIAGONALS (monotrait-heteromethod) The values in the cells labeled "b" are the correlations between different traits assessed by the same measurement method. These correlations represent similarities due to both the natural relationships between the constructs and the fact that they have been measured by the same method. They should be substantially smaller than the reliabilities Correlations between measures of the same trait measured using different methods. Since the MTMM is organized into method blocks, there is one validity diagonal in each method block. For example, look at the A1-A2 correlation of .57. This is the correlation between two measures of the same trait (A) measured with two different measures (1 and 2). Because the two measures are of the same trait or concept, we would expect them to be strongly correlated. You could also consider these values to be monotrait-heteromethod correlations.

Constant error

- The amount of error is independent of the size of the measurement e.g. measuring distance with a tape measure for which the hook at the end of the tape is not truly at zero. If you use this feature then the misplaced hook will cause the same amount of error regardless of the distance measured. - Correct constant error by subtracting the error (observed value - constant error)

Test-Retest Reliability

-A measure of the correlation between the scores of the same people on the same test given on two different occasions - Used to evaluate the error associated with administering a test at two different times - of value only when we measure "traits" or characteristics that do not change over time PROS: - easy to evaluate: administer the same test on two well-specified occasions and then fi nd the correlation between scores from the two administrations CONS: - beware of a carryover effect - first test session influences scores on second session Practice effect - improving with practice -Neglects variation arising out of the sampling of items -Whenever all testing is done at one time,day-to-day variation in the individual is omitted

META-ANALYSIS

-A statistical technique for combining the results of many studies on a particular topic, even when the studies used different data collection methods.

REGRESSION LINE

-A straight line that describes how a response variable y changes as an explanatory variable x changes

Coefficient Alpha [reliability co-effecient]

-Coefficient Alpha is the formula used to estimate Internal Consistency Reliability -Most commonly reported due to its ease of use -Purpose is to account for and thus minimize the variances of the split-halves in Spearman-Brown co-efficiency. -The value given by "alpha" is the average of all the possible Spearman-Brown corrected, split-half correlations, so it evens out the possible effects of a possible uneven split of the test into its two halves. -This form of reliability is used to judge the consistency of results across items on the same test. Essentially, you are comparing test items that measure the same construct to determine the tests internal consistency. When you see a question that seems very similar to another test question, it may indicate that the two questions are being used to gauge reliability. Because the two questions are similar and designed to measure the same thing, the test taker should answer both questions the same, which would indicate that the test has internal consistency.

Internal Consistency Reliability using Spearman-Brown to find the Coefficient Alpha [reliability co-effecient]

-Coefficient Alpha is the formula used to estimate Internal Consistency Reliability -Purpose is to account for and thus minimize the variances of the split-halves in Spearman-Brown co-efficiency. -The value given by "alpha" is the average of all the possible Spearman-Brown corrected, split-half correlations, so it evens out the possible effects of a possible uneven split of the test into its two halves. -This form of reliability is used to judge the consistency of results across items on the same test. Essentially, you are comparing test items that measure the same construct to determine the tests internal consistency. When you see a question that seems very similar to another test question, it may indicate that the two questions are being used to gauge reliability. Because the two questions are similar and designed to measure the same thing, the test taker should answer both questions the same, which would indicate that the test has internal consistency.

MINIMUM LEVEL OF RELIABILITY

-Generally .80 reliability is minimum that test publishers adhere to -Based in comparison to similar tests and their reliability coefficient -Can't state minimum level, but we CAN state minimul level needed to achieve specified levels of accuracy in describing a group or individual -Low levels of reliability, there is a higher chance that the order of two indivudal will be reversed because both measurements represent completely random error -.50 reliability= 1 in 3 chance of order reversal -.90 reliability= 1 in 12 chance of order reversal -Test with low reliability to be used with large groups to make useful studies/generalizations on said group; but CAN NOT be used with high reliability to speak on the individuals in that group

Reliability co-efficient

-Indicates how consistently a test places the individual's score in realtionship to others in the group -Variability of observed scores that is explained by variability of true scores. Ratio of trues score variance to observed. True score ÷Observed Variance. Closer to 1.00 the more reliable. True score= better reliability. Closer to 0 its due to random error, not test takers performance (inaccurate). Percentages. .99 is a 99% reliability coefficient, meaning little is left for error.

CRITERION REFERENCED TESTS: EXAM CONDITIONS

-Not enough testing time [then higher item incomplete response, increased guessing] and speed of performance becomes part of the -Administration conditions poor, it becomes part of the test content for the examinees -These factors vary from one group to the other; thus the content also varies

Split-Half Test/ Subdivided Test

-Parallel type of test using only one test that has been divided in half with an equal distribution of categories and difficulty levels. -Often use of odd#s for one test and even#s for the other is used. The premise is that usually similar items/difficulty level are often grouped together -Best used with tests with 60 items or more as it will tend to even out variables of form, content covered, difficulty level -Remember, this method of testing task variability divides the test in half only for scoring but not for administering; You will still have to adminster 2 tests for equal lengths of time

SELECTION RATIO

-Ratio of [job]openings to [job] applicants -Index ranging from 0-1 that reflects the ratio of positions to applicants; calculated by dividing the number of positions available by the number of applicants. -Low selection ratio is preferred as it allows an employer to be more selective/raise predictor cutoff score -Selection procedure is most beneficial when there are way fewer number of postings compared to applicants i.e. 1 posting 10 applicants rather than 9 postings/10 applicants -The higher the validity correlation/ coefficient the higher the accuracy of the selection predictor OVERALL VALUE OF SELECTION PROCEDURE DEPENDS ON 1.How success is defined 2.Selection rules used 3.Value given/assigned to what has been defined as 'success' 4.Cost of selecting and misidentifying/ accepting someone who subsequently fails 5.Cost of missing candidate who could have succeeded The greatest total gains from using a selection procedure occur in those situations where the rate of successful criterion performance in the population is close to 50% [.50]

ITEM RESPONSE THEORY [IRT]

-Tells us that we only get information about an examinee's position on the trait when there is a level of uncertainty whether he or she will pass the test -Another way of looking at reliability. Examining each item for its ability to discriminate as a function of the construct being measured. Extension of classical test theory which looks at the amount of error in the total test. looks at the probability that individuals will answer each item correctly (or match the quality being assessed). Or, each item is being assessed for its ability to measure the trait being examined

RANK SETS [ORDINAL SCALE]

-Tells us the order in which people stand, but not how much of the trait they have in relation to others • The differences between the numbers do not have an equal meaning; even when the measured distance is equal i.e. o Difference between rank of 1 and 5, has quite a different meaning than the rank of 50 and 55. They both differ from 4 units of rank, but normally the extreme rank of the group has a higher proportion of the trait compared to the middle rank of the group.

Split-Half Reliability Coeffecient [AKA Odd-Even Reliability Coeffecient]

-The correlation between the scores derived from the Split-Half test measures the reliability that the test is valid i.e. measuring what it is meant to measure at that particular point in time -Purpose is to provide a correlation of measurment if the entire test were given at the same time PROS: Use only one test; less resources in time CONS: Half-length test coeffecient does not equal full length test coeffecient. Generally not as accurate a measurement [the larger a sample of behaviours the greater the reliability of measure]

REGRESSION EQUATION

-The equation representing the relation between selected values of one variable (x) and observed values of the other (y) -An equation from which one can predict scores on one variable from one or more other variables

SELECT-RESPONSE ITEMS

-Those items for which student select their answers from several choices, -Test items in which response can select from one or more possible answers, without requiring the scorer to interpret their response

FACE VALIDITY

-Used to say that a test is acceptable to a learner, in that it meets the learner's expectations of what a test should be like. -An unscientific form of validity demonstrated when a measurement procedure superficially appears to measure what it claims to measure -That quality of an indicator that makes it seem a reasonable measure of some variable. That the frequency of attendance at religious services is some indication of a person's religiosity seems to make sense without a lot of explanation.

VALIDITY / GENERALIZATIONS

-Validity is most important characteristic for test -Test constructor and user have responsibility to push for validity THREE TYPES OF VALIDITY 1. Content-related validity reveals the relationship between the test content and the proposed interpretation. -If the correspondence, which is largely a matter of expert judgment is good then the proposed use is supported 2. Emperical-Criterion-Referenced Validity-Is obtained from the relationships of test scores with other variables of interest, particularly occupational and educational outcomes -Substantial correlations between test scores and measures of performance in the outcome task support the use of test prediction, selection, placement 3. Construct-Related Validity-comes from the correspondence of test scores to deductions from theory. Content and criterion-referenced related evidence help determine fit to predictions form theory, BUT response to experimental interventions and agreement with predicted patterns of relationship with other variables are also important ********************************************************* Measures of validity include: ____ (does it cover representative sample?), _____ (degree to which the test measures a theoretical trait), and ____ - _____ (how well does it predict individual performance?) -The ability of a screening instrument to predict performance in a job or setting different from the one in which the test was validated -The SAT has 3: 1. SAT and HSGPA correlate about equally with FYGPA 2. the combination of SAT and HSGPA is better than either alone in predicting FYGPA 3. Examination of original data indicates the correlations are about the same for different racial/ethnic groups -The higher the validity correlation/ coefficient the higher the accuracy of the selection predictor

SCALED SCORE

-Weighted score used when sub scores hold different weights when making up total score. Ex: Individual assignments have different scaled scores to make final grade -A common numerical range to which all candidates' raw scores map for comparison. Scaled scores are ordinal indexes that allow for different candidates to be compared; thus two candidates who receive a scaled score of 186 both have the same amount of knowledge, though they may have answered different test questions. A well-known example is the 1600 (now 2400) point scale used by the SAT exam.

RELIABILITY FACTORS

1. Variability of Group:The less shifting of positions within group from Form A to B or test to re-test, then the higher the reliability coeffecient 2.The Ability Level of the Group on the Trait being Tested: The greater the variability in ability levels i.e.[larger sample & sample ability levels within group] within the sample group, the higher the reliability factor.If it is a more heterogeneous group, rather than a homogeneous one for ability levels then the reliability increases 3. Length of Test:As with Spilt-Half Reliability Coefficient, test reliability increases with the length of the test -Factors when lengthening test;[time, resources, testee fatigue, ability to create comparable/good test items] -Law of Diminishing Returns: When test is already lengthy, takes considerable more items to produce a noticeable result -Increasing number of raters is one special type of lengther 4. Operations used for estimating reliability [different operations yield different coefficients] 5.Student Opinion: If students don't feel the test is reliable i.e. measuring what it is to be measuring, then some students will deliberately perform badly or not take the test seriously

INTERPRETIVE INFERENCE [Unified View of Validity]

1.Interpretive Inference: A statement of what test scores means/our interpretation of the scores -Validation involves bring together multiple lines of evidence to support certain interpretations of test scores, whilst proving that other interpretations are less plausible -Assigning meaning to test scores carries certain value implications i.e. IQ test and its number

Inconsistency/ Variation of sources

1.Person may have changed from first to second testing due to environmental/ personal/physical reasons 2.Task could have been different for the two measurements i.e. although throwing ball the examiner could have given practice throw one time and not second time; ball could have been more inflated 3.Limited sample of behaviour results in unstable and/or unreliable results 4.Changes in the individual's speed of work

Variation in reliability caused by:

1.Trial to trial error in individual's response to a task at a particular point in time i.e. throwing ball 2.The individual from one time to next i.e. environment; too hot/cold, noisy, migraine,stress 3. The task itself i.e. weight, feel, brand of given ball

Parallel Forms of Testing

1.Used to determine variation within the task, when there is concern with the task itself [See Variation in Reliability #3] 2. Different versions of a test used to assess test reliability. Alternate form of test measuring same behaviours [domain], same difficulty level, same question types [similar questions are asked] and same number of question types 3. Given successively if unconcerned about stability over time; given in intervals if stability is a concern 4.Most rigorous/reliable is test given with interval because all three variations of score [trial to trial; individual; task] can affect scores PROS: -Most rigorous, reliable.Test results reliable indicator to generalize about what student will do on similar tasks in the future -Parallel with time interval between tests permits all sources of instability to have their effect; in essence they are presumed to balance each other out CONS:Have to readily have on hand an alternate test; time consuming, administration of test can place burden on available resources

ACTION INFERENCE [Unified View of Validity]

2.Action Inference: The appropriateness and utility of test scores as the basis for some specific actions such as applied decision making -Validation requires both evidence of score meaning and evidence of the appropriateness and usefulness of test scores for particular applied purposes.

SUBDIVIDED TESTs [Split-Half Tests]

A procedure used to obtain a reliability estimate from a single administration of a test that divides the test into two presumably equivalent tests that are administered to the entire group and then the tests are divided into two parallel tests for comparing -The correlation between the two separate half-length tests is used to estimate the reliability of the whole test USE WHEN: -Tests that measure more than one trait are being developed because items measuring each trait must be equally present in each half-test.

CONTENT-RELATED VALIDITY [SEE TEST BLUE PRINT FOR MORE DETAILS]

An assessment of whether a test contains appropriate content and requires that appropriate processes be applied to that content. -To specify the contents and processes to be explicitly measured requires a TEST BLUEPRINT ********************************************* -Judgment by experts of the degree to which items, tasks, or questions on a test adequately represent the construct

MULTI-TRAIT MULTI-METHOD ANALYSIS CORRELATION-MTMM [Construct-Related Evidence of Validity] PREDICTIONS ABOUT CORRELATIONS

An explicit method for studying the patterns of high and low correlations among a set of measures suggested by Campbell and Fiske [1959]. -Requires several different methods to measure each of several different traits ******************************************* PRINCIPLES OF INTERPRETATION Now that you can identify the different parts of the MTMM, you can begin to understand the rules for interpreting it. You should realize that MTMM interpretation requires the researcher to use judgment. Even though some of the principles may be violated in an MTMM, you may still wind up concluding that you have fairly strong construct validity. In other words, you won't necessarily get perfect adherence to these principles in applied research settings, even when you do have evidence to support construct validity. To me, interpreting an MTMM is a lot like a physician's reading of an x-ray. A practiced eye can often spot things that the neophyte misses! A researcher who is experienced with MTMM can use it identify weaknesses in measurement as well as for assessing construct validity. To help make the principles more concrete, let's make the example a bit more realistic. We'll imagine that we are going to conduct a study of sixth grade students and that we want to measure three traits or concepts: Self Esteem (SE), Self Disclosure (SD) and Locus of Control (LC). Furthermore, let's measure each of these three different ways: a Paper-and-Pencil (P&P) measure, a Teacher rating, and a Parent rating. The results are arrayed in the MTMM. As the principles are presented, try to identify the appropriate coefficients in the MTMM and make a judgement yourself about the strength of construct validity claims. BASIC PRINCIPLES FOR RULES FOR MTMM 1. Coefficients in the reliability diagonal should consistently be the highest in the matrix. That is, a trait should be more highly correlated with itself than with anything else! This is uniformly true in our example. 2. Coefficients in the validity diagonals should be significantly different from zero and high enough to warrant further investigation. This is essentially evidence of convergent validity. All of the correlations in our example meet this criterion. 3. A validity coefficient should be higher than values lying in its column and row in the same heteromethod block. In other words, (SE P&P)-(SE Teacher) should be greater than (SE P&P)-(SD Teacher), (SE P&P)-(LC Teacher), (SE Teacher)-(SD P&P) and (SE Teacher)-(LC P&P). This is true in all cases in our example. 4. A validity coefficient should be higher than all coefficients in the heterotrait-monomethod triangles. This essentially emphasizes that trait factors should be stronger than methods factors. Note that this is not true in all cases in our example. For instance, the (LC P&P)-(LC Teacher) correlation of .46 is less than (SE Teacher)-(SD Teacher), (SE Teacher)-(LC Teacher), and (SD Teacher)-(LC Teacher) -- evidence that there might me a methods factor, especially on the Teacher observation method. The same pattern of trait interrelationship should be seen in all triangles. The example clearly meets this criterion. Notice that in all triangles the SE-SD relationship is approximately twice as large as the relationships that involve LC. ADVANTAGES AND DISADVANTAGES of MTMM ADVANTAGES:The MTMM idea provided an operational methodology for assessing construct validity. In the one matrix it was possible to examine both convergent and discriminant validity simultaneously. By its inclusion of methods on an equal footing with traits, Campbell and Fiske stressed the importance of looking for the effects of how we measure in addition to what we measure. And, MTMM provided a rigorous framework for assessing construct validity. DISADVANTAGES: Despite these advantages, MTMM has received little use since its introduction in 1959. There are several reasons. -First, in its purest form, MTMM requires that you have a fully-crossed measurement design -- each of several traits is measured by each of several methods. While Campbell and Fiske explicitly recognized that one could have an incomplete design, they stressed the importance of multiple replication of the same trait across method. In some applied research contexts, it just isn't possible to measure all traits with all desired methods (would you use an "observation" of weight?). In most applied social research, it just wasn't feasible to make methods an explicit part of the research design. -Second, the judgmental nature of the MTMM may have worked against its wider adoption (although it should actually be perceived as a strength). Many researchers wanted a test for construct validity that would result in a single statistical coefficient that could be tested -- the equivalent of a reliability coefficient. It was impossible with MTMM to quantify the degree of construct validity in a study. -Finally, the judgmental nature of MTMM meant that different researchers could legitimately arrive at different conclusions.

TRACKING

Another name for streaming, or grouping students homogeneously based on achievement or some other criteria

APPLIED SCIENCE [Unified View of Validity]

Any scientific enterprise that occurs in a political setting is by definition, "APPLIED SCIENCE"

SCALE

Assignment of numbers according to a set of rules i.e. G=Girls/B=Boys

BIAS VS. FAIRNESS

BIAS Typically demonstrated through empirical evidence-such as when test scores over-predict criterion performance for group but not another and can be revealed using scientific methods -At the item level bias can be detected by examining item responses for group X through differential performance on items using regression equations FAIRNESS Is a value judgment about the appropriateness of decisions or actions based on test scores -Philosophical/political concept NOT empirical or scientific; there are no stats that reveal the presences or absence of fairness -Individuals/groups often considered fair when decisions are consistent with their belief systems/values

PERFORMANCE ASSESSMENT

CH. 10

ASSESSING PROCESS

CH. 10 A method of assessing the steps, or sequence of events in a performance or product. Checklists and rating scales are frequently used process assessing tools

PERFORMANCE TEST/ ASSESSMENT OF COGNITIVE TASKS [CONSTRUCT-RESPONSE]

CH. 10 P. 322 CONSTRUCT-RESPONSE TO ASSESS COGNITIVE TASKS Evaluation of cognitive objectives by means of assessing the presentation or performance of the individual on the behaviour to be measured - Might be asked to write essay, develop a plan, create a solution or in some other manner demonstrate a skill.

VARIABLE RESPONSE

CH. 10 P. 322 Opposite of KEYED RESPONSE, have items with one or more answer key response. Variable Response is more complex as there is no real correct answer, just varying degrees of completeness or correctness.

ASSESSING PRODUCTS.PERFORMANCES

CH. 10 P. 328 Method used to evaluate a student work; most commonly done by use of comparison to a pre-determined performance standard

HALO EFFECT

CH. 10 P. 331 -The tendency of an observer to make subjective judgments based on general attributes, -To generalize and perceive that a person has a whole set of characteristics when your have actually observed only one characteristic, trait or behaviour

CH. 10 PERFORMANCE AND PRODUCT EVALUATION

CH. 10 PERFORMANCE AND PRODUCT EVALUATION

CONVENTIONAL TESTS

CH. 10 PG. 319

PRODUCT/ PROCESS

CH. 10 p. 321 Product; Usually refers to concrete object for which assessment is based Process: Are the steps, procedures, or parts, transitory in nature to create the product. SAFETY assessment is a process, which is transitory but takes precedence over product assessment

TRANSITORY PRODUCTS

CH. 10 p. 321 Where the product or performance is less important than the process to achieve the product or overall performance. The product is just one of many transitional steps to achieve mastery i.e. with each subsequent making of bagels or cheese cake the better I get

RUBRICS

CH. 10 p. 322 -Often hierarchical criteria (standards) used on checklists, rating sheets, or task analyses to identify the qualities of knowledge or skill or specify the elements of performance and/or interactions between individual and environment that will be used to indicate proficiency. -Often samples different levels of skills/rating are given to the students and used by the scorers

CHECKLISTS

CH. 10 p. 323 Method to assess the parts/steps of a process and used to evaluate the appropriateness of a behaviour

MULTIPLE OBSERVERS

CH. 10 p. 328 More than one observer who observes who evaluates a product or performance

DICHOTOMOUS JUDGMENT

CH. 10 p. 329 valuation or assessing the reliability of our judgments based on its rightness or wrongness on whether the behaviour is present or absent; hence "di".

SYSTEMATIC OBSERVATION

CH. 10 p. 330 -Method to observe, record, evaluate changes in behaviour rather than cognitive ability or evaluating a product. -Observer functions as an objective mechanical recording device of observations

DIFFICULT-TO-EVALUATE PRODUCTS

CH. 10 p.321 These are products that are subjective and the evaluative process fails to have consistent/reliable methods to assess/measure the product. It is easier to evaluate the process of teaching than the product of teaching; when the products are students, educations and their various stakeholders. Counselling, administration, politician, citizenry

RATING SCALES

CH. 10, he degrees to which a task has been achieved, rather than just providing a check. Capacity both to determine the degree to which behaviour has been achieved and to assess the affective dimension.

BEHAVIOURAL STATEMENTS

CH. 11 Statements or descriptors to attributed to behavioural traits.

CH.11 ATTITUDES AND RATING SCALES

CH. 11 ATTITUDE AND RATING SCALES

APTITUDE TEST

CH. 11 PG. 374

BINET'S THEORY

CH. 11 PG. 375

INTELLIGENCE

CH. 11 PG. 375

COMMON FACTOR ANALYSIS

CH. 11 PG. 376

INTELLIGENCE QUOTIENT

CH. 11 PG. 376

PRIMARY MENTAL ABILITIES

CH. 11 PG. 376

SPEARMEN'S G

CH. 11 PG. 376

THURSTONES PRIMARY MENTAL ABILITIES

CH. 11 PG. 376

PSYCHOMETRIC OR STRUCTURAL THEORIES

CH. 11 PG. 377

WECHSLER'S THEORY

CH. 11 PG. 378

CATTELL-HORN THEORY

CH. 11 PG. 379

CRYSTALLIZED INTELLIGENCE

CH. 11 PG. 379

FLUID INTELLIGENCE

CH. 11 PG. 379

JENSEN'S THEORY

CH. 11 PG. 379

STERNBERG'S TRIARCHIC THEORY

CH. 11 PG. 380

DAS-NAGLIERI COGNITIVE ASSESSMENT SYSTS [CASS]

CH. 11 PG. 381

THEORY OF MULTIPLE INTELLIGENCES [GARDNER]

CH. 11 PG. 382

RATIO IQ

CH. 11 PG. 384

RUTING TEST

CH. 11 PG. 385

STANFORD-BINET INTELLIGENCE SCALES [SBB-IV AND SB5]

CH. 11 PG. 385

NON-VERBAL MEASURES OF COGNITIVE ABILITY [RAVEN'S & UNIT]

CH. 11 PG. 397

DIFFERENTIAL APTITUDE TEST BATTERY [DAT]

CH. 11 PG. 408

COVERT CHARACTERISTICS

CH. 11 p. 345 Private characteristics of personality that can only be sketchily deduced by an individual's actions

HALO EFFECT

CH. 11 p. 348 propensity for raters to base evaluations on overall impression, rather than a specific attribute general

STIMULUS VARIABLES

CH. 11 p. 350 The qualities to be rated which are usually given trait names

RESPONSE OPTIONS

CH. 11 p. 350 The various ratings that can be given; i.e. numerical or adjectival categories

ALTERNATION RANKING

CH. 11 p. 360 Rater ranking method of picking the individual highest on a given trait, and alternating with picking the individual lowest on the trait. Those left, automatically fall into the middle. This is done when rater knows many ratees.

SUMMATIVE RATING

CH. 11 p. 363 i. Person's attitude is reflected by the sum of the responses to the given statement.

MEDIAN RATING

CH. 11 p. 366 As it relates to item selection for rating scales using judges "for to against" scale: The median position of any item on the scale of "favourableness-unfavourableness" as rated by the judges and influences the spread of ratings [INTERQUARTILE RANGE] that indicates levels of ambiguity.

INTERQUARTILE RANGE

CH. 11 p. 366 As it relates to item selection for rating scales using judges "for to against" scale: The spread of ratings [INTERQUARTILE RANGE] that indicates levels of ambiguity in relation to the median position of any item on the scale of "favourableness-unfavourableness" as rated by the judges and influences

ALTERNATIVE FORMATS

CH. 11 p. 368 Formats other than summative to measure attitudes: 1. Thurstone Scaling 2. Guttman Scales 3. Semantic Differential

THURSTONE SCALING

CH. 11 p. 368 Named after author -Refers to "a way of measuring people's attitudes along a single dimension by asking them to indicate that they agree or disagree with each of a large set of statements (ex. 100) that are written about that attitude." Choice of two possible responses [Ex. people should exercise if they want to be healthy: Agree___ Disagree____ OR True____False___] -, is a technique for measuring an attitude. It is made up of statements about a particular issue, and each statement has a numerical value indicating how favorable or unfavorable it is judged to be. Also has a 1-to-11 scale.

RELIABILITY

CH. 11 p. 368 Result of item analysis [SEE ITEM ANALYSIS] that determines the internal consistency reliability of attitude rating scales

SEMANTIC DIFFERENTIAL

CH. 11 p. 369 Grew out of work of Osgood, which applies to the domain of meaning giving to adjectives to describe human traits and can be represented by three main categories: 1. Evaluative [good, bad, mean, kind] 2. Potency[strong, weak,able, helpless] 3. Activity[energetic, lazy, busy, idle]

GUTTMAN SCALES

CH. 11 p. 369 Named after author Person's position on the continuum is defined by the highest statement accepted -Items are arranged in an order so that an individual who agrees with a particular item also agrees with items of lower rank-order. For example, a series of items could be (1) "I am willing to be near ice cream"; (2) "I am willing to smell ice cream"; (3) "I am willing to eat ice cream"; and (4) "I love to eat ice cream". Agreement with any one item implies agreement with the lower-order items. -threshold would be the place where participant answers no, thereby reaching their limit

LIKERT SCALE

CH. 11 pg. 363 i. Person's attitude is reflected by the sum of the responses to the given statement.

ITEM ANALYSIS

CH. 11 pg. 366 Is the administration of a set of survey/opinion poll statements, derived from the item selection process [p. 366] that is given to a pilot group and then analyzed to find the coefficient of reliability. The higher the correlation the better the item.

FORCED-CHOICE FORMAT

CH. 11, -Requires the rater to choose two statements out of four that could describe the ratee. EX: chose the two items that best describe your instructor. -The test taker is forced to choose one of the answers, even if none of them seems to fit their interests precisely, -a question type in which respondents give their opinion by picking the best of two or more options

Q-SORT

CH. 11, -Self-report assessment procedure designed to measure the discrepancy between a person's actual and ideal selves. -participants are presented with a set of cards on which words or statements are written. participants are asked to sort the cards along a specified bipolar dimension, such as agree or disagree. typically b/w 50 to 100 cards to be sorted into 9 or 11 piles

GRAPHIC SCALES

CH. 11, - a performance evaluation method that identifies various job dimensions and contains scales that are used to rate each employee on each dimension. -Measurement scales that include a graphic continuum, anchored by two extremes -Scale that uses adjectives or numbers as anchors but the descriptive detail of the anchors differs widely.,

RATER BIASES

CH. 11, -Halo effect, central tendency, leniency, and strictness biases, contrast effect Leniency/Strictness=avoiding the middle range and rating all employees as high (L) or low (S) on all dimensions -Central Tendency Bias=only using the middle range -Halo Effect=how a rater rates on one dimension affects rating on unrelated dimensions, general impression of an employee is how he rates them on all dimensions ,-Contaminating factors in the rating process related to the way that the rater makes ratings.

EDUCATIONAL PLACEMENT DECISIONS

CH. 7 What are the important considerations for how a placement decision should be determined for a student? a. It should be based on finding the best match between the instructional requirements of the student and the best placement to meet those instructional needs of the student. b. Student should come first in placement decisions and the potential benefits that the student would gain because of that placement decision must be at the forefront.

PAPER-AND-PENCIL TESTS

CH. 7 c. Teacher-Made Assessments i. Paper and Pencil 1. Purposes/Uses a. Usually used for learning outcomes that are categorized as knowledge based and/or require said knowledge to be applied i.e. solving math problems, assessing reading and writing skills 2. Limitations a. Usually poorly made because of 2 things i. Lacks validity; test items fail to support objective[s] of assessment ii. Test lacks psychometric qualities i.e. Test Blue-Print, internal validity, test-retest b. Therefore the results and use of results are skewed and often fail to have the desired results

ORAL TESTS

CH. 7 ii. Oral Tests 1. Purposes/Uses a. Require more time of the teacher, but are similar to the objectives of the paper-and-pencil tests as they can assess the same types of learning outcomes. b. Can be more advantageous to both students and teachers when used to assess attributes that are more qualitative in nature: i. Debating skills ii. Expressive language iii. Synthesis of ideas iv. Higher level thinking skills c. Used for students who have limited written skills, learning disability, or suffer from paper-pencil test anxiety d. Quickly used to identify potential learning difficulties i.e. "Explain how you would do such and such? Or, "Tell me how you got that answer?" The responses can help to identify the source of difficulty and areas of strength or "right thinking." e. Students with sensory or motor deficits may be more accurately assessed by this form of assessment 2. Limitations/ Restrictions a. Evaluation tends to be more subjective b. Oral tests are less accurate in identifying fundamental/causal reading difficulties, although they may help to isolate absent sub-skills

PRODUCT EVALUATION

CH. 7 iii. Product [tangible item created by student, on which he/she is to be assessed]Evaluations 1. Purposes/Uses a. Woodwork, book report, video, penmanship, portfolio b. Used to assess the product rather than using other teacher-made assessment tools i.e. it is better to test shop safety with pen-pencil, and use of woodworking equipment by evaluating the finished product 2. Limitations/Restrictions a. Most teachers lack skill to isolate and define the given attribute, and hence, create a valid and reliable measuring tool to assess the attribute or aspect

PERFORMANCE ASSESSMENT

CH. 7 iv. Performance Assessment [test] 1. Purposes/Uses a. Often overlaps with oral tests b. Best used in assessments that involve sequence of skills, procedures, endurance, timed events i.e. singing, playing an instrument, running a marathon, # of push-ups in a minute c. Sometimes is only form of assessment that can be used i.e. how fast can you run 100 m; can you tie your shoe, can you speak a foreign language d. Rarely leaves a tangible product [however it can i.e. video, recording] 2. Limitations/Restrictions a. Similar to product and oral assessment i. Difficult to isolate and measure given attribute or construct b. Difficult to measure degree of mastery or performance due to less than clearly defined and agreed to criteria. Results in problems of reliability

AFFECTIVE ASSESSMENT MEASURES

CH. 7 v. Affective Assessment Measures - Support the cognitive instructional objective. Personal and social attributes that teachers and/or society want/expect students to have. Very qualitative in nature. 1. Purposes/Uses a. Instill and encourage values such as citizenry, tolerance, fairness, appreciation b. Use observations to infer and hence rate the attribute c. Also use self-rating scales d. Needs to be ongoing to allow teacher to adjust teaching strategies, environment, curriculum as reflected by students. 2. Limitations/ Restrictions a. Assessing only the by-product of a feeling or belief system and not the motivation or ethics underlying the by-product. For example we say that student A is honest because he turned in money that his friend, Student B lost. Perhaps Student A was the one who took it, and wanted to be in the good graces of Student B, and appear to be known as an honest person. b. Self-rating is only as good as the level of honesty and self-awareness of the rater/student.

MAINSTREAMING

CH. 7 -Requires that student to be educated in the "least restrictive" environment that meets their individual learning needs: -Called MAINSTREAMING [SEE PG. 228 PL 94-142]

PUBLIC LAW 94-142

CH. 7 1975 Education For All Handicapped Children Act also known as Public Law [PL]94-142 2004 amended to IDEA[ Individuals with Disabilities Education Improvement Act] -Gave same rights to disabled to inclusive education -Requires that student to be educated in the "least restrictive" environment that meets their individual learning needs: -Called MAINSTREAMING

HIGH STAKES-TEST

CH. 7 A test in which, its results has a direct impact on the examinee and its sole purpose is utilized by organizations and institutions to inform decisions regarding placement, programs, and resources.

EMOTIONAL DISTURBANCE

CH. 7 Children whose behaviour is so negative, severe, persistent, that it impairs their ability to achieve and thrive in an educational setting, and negatively impacts and interferes with the learning environment of others. i. Assessed through behavioural testing Behaviour rating scale i.e. BASC 1. Would have to score in the 1 or 2nd %ile compared to peers their age to be considered emotionally disturbed, because there is such extreme variations amongst children ii. Student record for discipline infractions iii. Social Services involvement iv. Outside agency involvement i.e. corrections, mental health

LEARNING DISABILITY

CH. 7 Is different than disability as it must be proven that the student's intellectual functioning is normal and NOT low!!! Determined via intelligence/achievement test: -Needs to show discrepancy between intellectual capacity/ability [Wechsler] and academic performance [WIAT] and report card marks.

MINIMUM COMPETENCY LEVELS

CH. 7 Minimum Competency Levels comes with its own issues and have been challenged by the legal system: 1. Who determines who gets a diploma? 2. How much time is actually required for remediation? 3. Racial minorities comprise a larger percent of the population, which is negatively impacted by use of high-stakes testing. More of them fail compared to their white counter-parts. The concern is that these failure rates results may be used to perpetuate the historical bias towards separate and unequal. 4. Concerns when results of testing are used as primary data, to the exclusion of other formative assessment tools, when decision-making. 5. The impact of decisions, require ongoing monitoring to minimize individual risk or potential of risk. 6. ELL/ESL students may require accommodations, when skill being tested is English. 7. The same applies for students with disabilities to increase score validity. 8. To counter-act the above concern with minimum competency testing three things have been promoted: a. Minimum competency must not promote historical academic segregation b. Content validity must comprise both curricular, which means the test must match the curriculum being taught, and instructional validity, which means the test must match the actual curriculum being taught.

SUMMATIVE EVALUATION

CH. 7 Often, formal norm/criterion referenced "High-Stakes," testing, administered at the completion of a prescribed set of learning outcomes. -Usually teacher constructed and teacher administered -Results may warrant referral to school psychologist or counsellor for further assessment, access to specific emotional or academic supports, or placement in special. ed. programme -Best practice dictates that both types of assessment tools [formative/summative] be used in conjunction with each other and that they are valid and being used to measure the given attribute[s] and that they are reliable and can consistently measure the attribute[s] no matter what class it is administered to.

FORMATIVE ASSESSMENT

CH. 7 Ongoing assessment tools and methods that direct the instruction for the class and individual and guide the direction that the teacher will go i.e. more emphasis placed on a concept or the students have mastered it so the teaching of a learning outcome can be omitted. i.Best practice dictates that both types of assessment tools [formative/summative] be used in conjunction with each other and that they are valid and being used to measure the given attribute[s] and that they are reliable and can consistently measure the attribute[s] no matter what class it is administered to.

STANDARDIZED ACHIEVEMENT TESTS

CH. 7 a. Standardized Achievement Test i. Purpose/Uses 1. Usually at the beginning of new year to help determine instructional decisions as a class and for the individuals in the class. I have used it to determine whether as a class I will use an adapted math curriculum [which meets the learning outcomes but is more visual, has more repetition, and contains ongoing reminders of terms, concepts and steps], the regular textbook version [which is less visual and is more condensed] or whether I should divide the class in two and use both resources. 2. Can be administered at year's end, especially when used with one done in the fall to compare achievement and growth. When I taught grade 2 at an elementary school, my administrator required us to show one year's growth in all of the students. We used the CAT for this purpose. Of my 32 students, only 2 did not achieve this. It allowed us to prove to parents that further testing was needed. Both were sent for further psych. ed. testing which resulted in one student being categorized as LD and one having moderate to severe intellectual functioning. ii. Restrictions/Limitations 1. Only good for initial placement and not ongoing or formative assessing that focuses on the objectives and content that are unique to each classroom teacher. Standardized tests generally fail to provide answers to those type of questions, because to test nation or province wide, the questions have to be more generic in nature. 2. Information is often needed promptly and usually test results are returned after the need occurs. 3. Only measures cognitive attributes, which can be assessed via pen-pencil tests. Very limited in its use to measure oral and motor skills and affective domains. 4. Standardized tests tend to be cost prohibitive and its widespread use is usually regulated by time to administer and time to mark, and accessibility.

ITEM SAMPLING

CH. 7 [NOTE: NOT TAKEN FROM TEXT AS I COULDN'T FIND] Also referred to as content sampling, the variety of the subject matter contained in the items; frequently referred to in the context of the variation between individual test items in a test or between test items in two or more tests, 140 item trace line. Method: Alternate forms or parallel forms - Correlation between equivalent forms of the test that have different items

TEACHER-MADE TESTS

CH. 7 [See Individual terms] Five types of tests: 1. Paper and Pencil 2. Oral 3. Product Evaluation 4. Performance 5. Affective Measures

DEBRIEFING

CH. 8 -Giving participants in a research study a prompt and complete explanation of the study after the study is completed -The post experimental explanation of a study, including its purpose and any deceptions, to its participants -Take responsible steps to correct any misconceptions/deceptions -Reduce risk of harm due to justified delays/withholding of information -Make immediate corrections when made aware of post-research harm,

INTERNET-BASED PSYCHOLOGICAL TESTING

CH. 8 -ON-line assessment tools

PRIVACY

CH. 8 1.Deals with the degree of access to an individual's body or behaviour.. NOT TO BE CONFUSED WITH CONFIDENTIALITY, which focuses on an individual's personal information and the degree to which, others have to access an individual's information that is voluntarily given 2. Divulge [written/oral] only information pertinent to context of communication 3. Divulge [written/oral] only with and for appropriate scientific/professional people/purposes concerned with such matters

INFORMED CONSENT

CH. 8 1.When an individual or his/her legal representative gives written consent to participate in an activity [client-professional relationship; test, survey, research, medical procedure], and is made aware of all the known and potential risks and benefits, prior to engaging in said activity. 2. Language is such that is reasonable for the client to understand 3. Not necessary where mandated by law 4. If legally incapable of giving informed consent then must give appropriate explanation, seek individual's assent, consider individual's best interests/preferences, get consent from legal guardian, 5. If court ordered and/or mandated and explain process, nature of services, disclose if court ordered or mandated 6. Consent is documented 7. Client can choose at any time whether to enter into or terminate counselling relationship

TEST BIAS

CH. 8 An undesirable characteristic of tests in which item content discriminates against certain students on the basis of socioeconomic status, race, ethnicity, or gender., -Tendency of a test to predict outcomes better in one group than another

SELF-DETERMINATION

CH. 8 Concept that all individuals, including those with disabilities have the right to govern themselves

FALSE POSITIVES

CH. 8 The individual has good test scores contrary to their actual lower ability or skill levels. The result is that the individual in incorrectly labeled as meeting a given criteria, when in actuality they do not, due to an error in assessment, which is then transferred and applied to making high-stakes classifying, diagnosing, and/or selecting decisions.

FALSE NEGATIVES

CH. 8 The individual has poor test scores contrary to their actual higher ability or skill levels. The result is that the individual is incorrectly labeled as not meeting a given criteria, when in actuality they do, due to an error in assessment, which is then transferred and applied to making high-stakes classifying, diagnosing, and/or selecting decisions.

COMPETENCY

CH. 8 The knowledge, skills, and/or attitudes necessary for performing certain tasks.

AUTONOMY

CH. 8 To rule oneself; that is, to do things by oneself in the manner one chooses--NOT to be ruled or governed by others

ADVERSE IMPACT

CH. 8 Unintentional discrimination that occurs when members of a particular race, sex, or ethnic group are unintentionally harmed or disadvantaged because they are hired, promoted, or trained (or any other employment decision) at substantially lower rates than others -Occurs when the selection rate for a protected class is less than 80% of the rate for the class with the highest selection rate; also known as disparate impact.

CH. 8 ETHICS AND ISSUES IN ASSESSMENT

CH. 8 ETHICS AND ISSUES IN ASSESSMENT

UNIFIED VALIDITY

CH. 8 PG. 270 Messick's rubric of Unified Validity incorporates the idea that the worthiness of a test cannot be isolated or considered in isolation from the social benefit derived from the test. In other words, test-worthiness is based on social benefit derived from the results of testing.

HIGH-STAKES TESTING

CH. 8 Tests is used to inform critical decisions that impact the educational future, programming, placement, and resources for K-12 students. : i. Results may determine which schools gets /students get rewarded/penalized and which students get promoted or have mandated remediation or retention ii. Purpose of competency testing is threefold: 1. First, to identify those requiring remediation. 2. Second, to ensure that all those being promoted have met a minimum competency level. 3. Third, to increase student motivation to achieve more academically.

BASE RATE

CH. 8 The proportion of a group of applicants who would have succeeded if all were admitted the 'BASE RATE OF SUCCESS'

HIT RATE

CH. 8 The proportion of correct decisions that result from a given selection strategy [the proportion of correctly identified successes, plus the proportion of correctly rejected failures] -The proportion of people who are accurately identified as possessing or not possessing a particular trait, behaviour, characteristic, or attribute based on test scores -As the base rate departs from 50% [either higher-lower], the potential 'HIT RATE' for the selection procedure diminishes -HIT RATE: The proportion of correct decisions [sum of proportions of success and rejections] that result from selection strategy -Each 'CUTTING SCORE' has a 'HIT RATE' that depends on the 'BASE RATE of SUCCESS' in the population and the correlation between the predictor and the criterion

CO-CONSENTER

CH. 8 [6.15]

CONFIDENTIALITY

CH. 8 [8.2, 8.6, 12.11] 1. Deals with an individual's personal information. The degree to which, others have to access an individual's information that is voluntarily given. NOT TO BE CONFUSED WITH PRIVACY, which focuses on the degree of access to an individual's body or behaviour. 2. Client made aware of risks to confidentiality imposed via electronic transmission and use of internet 3. Discussion of confidentiality occurs at the outset of relationship, and ongoing as situations arise

UNIFIED VALIDITY

CH. 8, In early 1980's unified view of test validity/validation emerged. Messick [1989] argued that it is not just enough to create a test, but you have to use the results/scores for some interpretive purpose and use exclusion of other validation processes to prove that the chosen method of validation is the correct one i.e. looking for murderer and ruling out that no others could be the murderer. Score meanings must now include the value implications of score interpretation and the utility relevance, and social consequences associated with test use [both actual and potential]. The appropriateness of a particular test use must be justified in light of all possible outcomes. SUCH JUSTIFICATION REQUIRES NOT ONLY THAT THE TEST SERVE ITS INTENDED PURPOSE BUT ALSO THAT THE VALUE OF DOING SO OUTWEIGHS THE IMPACT OF ANY ADVERSE SOCIAL CONSEQUENCES! Unified view looks at test validation as a "process" of gathering evidence to build the best possible case for the inferences we would like to make in support of two types of inferences, and the social, personal and long term effects of using such inferences: 1.Interpretive Inference: A statement of what test scores means/our interpretation of the scores -Validation involves bring together multiple lines of evidence to support certain interpretations of test scores, whilst proving that other interpretations are less plausible -Assigning meaning to test scores carries certain value implications i.e. IQ test and its number 2.Action Inference: The appropriateness and utility of test scores as the basis for some specific actions such as applied decision making -Validation requires both evidence of score meaning and evidence of the appropriateness and usefulness of test scores for particular applied purposes.

DISSENT

CH. 8, 1. Hold or express opinions that are at variance with those previously, commonly, or officially expressed; disagree. 2. The expression or holding of opinions at variance with those previously, commonly, or officially held.

BENEFICENCE

CH. 8, Provide services that will benefit the client and doing good or causing good to be done; kindly action; Doing no harm and minimizing harm -Beneficence and nonmaleficence means that psychologists should try to help their patients and should do no harm to them, while minimizing any unavoidable harm in the event of a conflict of obligations.

DECEPTION

CH. 8, in research, an effect by which participants are misinformed or misled about the study's methods and purposes

ASSENT

CH. 8, to agree to something especially after thoughtful consideration

DISTRIBUTIVE SOCIAL JUSTICE

CH. 8-Page 270 Inherent principle of fairness as it applies to the equitable and impartial distribution of educational/ psychological services to all people.

DISCRIMINATION INDEX

CH. 9 -The proportion of test takers in the upper group who got the item correct - the number who got the item correct in the low group -Usually the top and bottom 27% of examinees. However for class purposes, use the mean.

NEGATIVELY DISCRIMINATING ITEMS

CH. 9 Describes condition when more people in the upper group fail the test item compared to those in the lower group. Mot undesirable and item should be discarded

POSITIVELY DISCRIMINATING DISTRACTOR

CH. 9 Describes condition when more people in the upper group pass the test item compared to those in the lower group. most desirable and item should be kept

PRODUCE-RESPONSE ITEMS

CH. 9 Essay type or short answer type questions where examinee must supply response

CORRECTION FOR GUESSING

CH. 9 Score adjustment for test item guessing using a formula: The [R]ight number of answered correctly is subtracted from the [W]rong number of items answered incorrectly, which is then divided by the [N]umber of options or choices for the item. Items which are omitted are excluded from the formula. R-W/N-1

POINT-BISERIAL CORRELATION

CH. 9 The correlation between total score and the number of examinees that correctly answered the question.

FOILS

CH. 9 The incorrect answers in multiple choice items

ITEM DIFFICULTY

CH. 9 The proportion of examinees who correctly answered the item. Should be called item easiness because the higher the value the easier the test item.

SELECT RESPONSE ITEMS

CH. 9 These are items where the examinee is provided with options and must select the correct answer

OPTIONS - EFFECTIVE/FUNCTIONING

CH. 9 These are the choices or options given for the item. Effective and functioning options reduce the risk of guessing.

DISTRACTORS

CH. 9 These are the incorrect choices for the item.

STEM

CH. 9 This is the part of the item in which the problem is stated for the examinee. It can be a question, a set of directions or a statement with an embedded blank.

ITEM TRYOUTS

CH. 9 Where commercial publishers "try out" hundreds of test items on hundreds of sample pilot groups to get empirical data regarding test item quality.

CH. 9 TEST DEVELOPMENT

CH. 9 TEST DEVELOPMENT

SELECTION RATIO

CH.8, the number of applicants compared with the number of people to be hired

CHAPTER 7 ASSESSMENT AND EDUCATIONAL DECISION MAKING

CHAPTER 7 ASSESSMENT AND EDUCATIONAL DECISION MAKING

VALUE-BASED DECISIONS: FOUR COMMON USES

Cannot serve all values simultaneously; must decide which is most important given the context 1. Selection according to ability 2. Effort 3. Accomplishments 4. Need

CRITERION RELATED VALIDITY PRACTICE OF PREDICTION [Re: Criterion Related Evidence of Validity]

Ch. 2 use regression line to make the best prediction of a person's score on one variable from their score on another. -This same regression line provides the most accurate and therefore most valid prediction of people's scores on a criterion variable. -The square of the correlation of the two variables tells us the strength of the relationship; specifically, it shows how much of the variation in scores on the criterion variable is predictable from the predictor. -Applicants who have done poorly on the test may be less impressed by the fact they the 'probability' is that they will be below average, than by the fact that they may still do very well. Each individual may be the exception to the rule.

META-ANALYSIS & VALIDITY GENERALIZATIONS

Coined by Glass [1977] to apply to the systematic pooling and integration of results from many different studies on the same phenomenon to provide a somewhat consistent method of test validity/coefficients that would apply to the group as a whole -I.E. Pooling ability test data so that when pooled, all tests could be used/compared using the validity coefficient instead of each having varying degrees of discrepancy when compared to each other -This method of pooling is considered more stable and more true value than the value from a local sample;which normally has a smaller pool and decreased level of validity VALIDITY GENERALIZATION Named by Schmidt and Hunter to the application of meta-analysis -Refers to the extent to which a validity established in one setting with one sample can be generalized to another setting and sample. -Assumes that the results of criterion-related validity studies conducted in other companies can be generalized to the situation in your company. -The extent to which validity coefficients can be generalized across situations

VALIDITY GENERALIZATIONS AND META-ANALYSIS

Coined by Glass [1977] to apply to the systematic pooling and integration of results from many different studies on the same phenomenon to provide a somewhat consistent method of test validity/coefficients that would apply to the group as a whole -I.E. Pooling ability test data so that when pooled, all tests could be used/compared using the validity coefficient instead of each having varying degrees of discrepancy when compared to each other -This method of pooling is considered more stable and more true value than the value from a local sample;which normally has a smaller pool and decreased level of validity VALIDITY GENERALIZATION Named by Schmidt and Hunter to the application of meta-analysis -Refers to the extent to which a validity established in one setting with one sample can be generalized to another setting and sample. -Assumes that the results of criterion-related validity studies conducted in other companies can be generalized to the situation in your company. -The extent to which validity coefficients can be generalized across situations

CONSTRUCT VALIDITY AS THE WHOLE OF VALIDITY [Unified View of Validity]

Construct-related evidence comes from the correspondence between test scores to deductions from theory. Content and criterion-related evidence help to determine fit to predicted patterns of relationship with other variables; BUT response to experimental interventions and agreement with predicted patterns of relationship with other variables are also important. ////////////////////////////////////////////////// According to Messick[1989], construct validity encompasses all of the "evidence and rationales supporting the trustworthiness of score interpretations in terms of explanatory concepts that account for both test performance and relationships with other variables"[p. 34]. A critical feature of construct validation efforts is that there be some organizing theoretical or conceptual frame-work to serve as a guide to score interpretation -Construct validity viewed as whole of validity when compared to content and criterion related evidence -Because it's basis is providing multiple proof to support the validity of the construct being measured, it unifies all other measurements and is therefore the "whole" of validity;content and criterion related evidence do not **************************************** Construct validity refers to the validity of inferences that observations or measurement tools actually represent or measure the construct being investigated.[2] In lay terms, construct validity examines the question: Does the measure behave like the theory says a measure of that construct should behave? Constructs are abstractions that are deliberately created by researchers in order to conceptualize the latent variable, which is the cause of scores on a given measure (although it is not directly observable). Construct validity is essential to the perceived overall validity of the test.

VALIDITY-CRITERION REFERENCES TESTS

Criterion-referenced tests are concerned with the measurement of particular, narrowly defined instructional objectives. Because of this the critical concern in assessing validity is CONTENT-RELATED evidence. Includes 2 elements: 1. Information content of the domain 2. What individual should be able to do with it

EFFECTS OF UNRELIABILITY ON CORRELATION OF VARIABLES

Deals with chance errors between tests and the method to account for said erros [SEE CORRECTING FOR ATTENUATION]

CONSTRUCT VALIDITY INTERNAL/EXTERNAL COMPONENTS [Construct Validity as the Whole of Validity]

Each construct is embedded in a theory/network of theories that describes our understanding. Conceptual framework includes both internal/external models of the interconnected facets/dimensions of a construct and the external model detailing the relationships between that construct and other constructs. Construct validation of score-based inferences traditionally has both. INTERNAL -Consists of all the elements of a theory that are necessary to define the construct, which can then be translated into testable hypotheses that are, in turn open to empirical confirmation or refutation -description of thought processes thought to underlie performance on test, organization -Methods of evaluating increasingly complex -Appropriate weighting of different aspects for multidimensional [and perhaps hierarchically arranged] constructs -Suitability of different item formats relative to the underlying processes to be tested -The above issues easily tested with correlation coefficients; BUT test theorists currently rely on ITEM-RESPONSE theory to evaluate specified internal relationships EXTERNAL Details its expected relationships with other constructs. It includes other constructs as well as the relationship among them and the strength and direction of that magnitude by externally testing the hypotheses. -Convergent and discriminant evidence is collected -Scores on a test reflecting a given constructs should correlate more highly with other indicators of the same construct than they do with measures of different constructs

INTERPRETATION OF VALIDITY COEFFICIENTS/CORRELATIONS [Re: Criterion Related Evidence of Validity]

FACTORS THAT DISTORT VALIDITY COEFFICIENTS: 1. Unreliability of the predictor and of the criterion that is being predicted [See Ch. 4, attenuation effect] 2. Restriction in the range of ability group by some type of preselection, often based on the predictor itself 3. Validity may be fairly specific to a particular curriculum or job. Therefore, validity must always be evaluated in relation to a situation as similar as possible to the one in which the measure is to be used Low reliability or preselection will tend to lower the values that are obtained for validity coefficients so that "True Validities" are typically higher than the values obtained in validation studies EXAMPLE OF RESTRICTING RANGE OF ABILITIES ON VALIDITY OF PREDICTOR: Too many applicants, now have to move from first come-first served, to criterion selection [grades/aptitude test]. Correlation between college grades/high school grades went down from .61 to .47 when changed from old system to new system. WHY???? Results of smaller range of talent among selected applicants The usefulness of a test as a predictor depends not only on how well it correlates with a criterion measure, but on how much new information it gives. -The higher the correlation level between a test/other predictor and a criterion measure the better HOW HIGH VALIDITY RATIO? -Answer based on the # of people to be selected called the "SELECTION RATIO" and the prevalence of success in the population, called the "BASE RATIO" LIMITATIONS OF CRITERION RELATED VALIDITY: Criterion-related validity is most important for a test that is to be used to predict outcomes that are represented by clear-cut criterion measures. 1. The main limitations to using criterion-related validity with the prediction context usually lies in the limited adequacy of the available criterion measures 2. The more readily we can identify a performance criterion that unquestionable represent the results that we are interested in, the more we will be prepared to rely on the evidence from correlations between a test and measures of that criterion to guide our decision on whether to use the test scores

FAIRNESS VS. BIAS

FAIRNESS Is a value judgment about the appropriateness of decisions or actions based on test scores -Philosophical/political concept NOT empirical or scientific; there are no stats that reveal the presences or absence of fairness -Individuals/groups often considered fair when decisions are consistent with their belief systems/values BIAS Typically demonstrated through empirical evidence-such as when test scores over-predict criterion performance for group but not another and can be revealed using scientific methods -At the item level bias can be detected by examining item responses for group X through differential performance on items using regression equations

CORRECTING FOR ATTENUATION DUE TO UNRELIABILITY

Formula used to correct the unreliabilility of "Correlation of Variables" to extract an estimate of the underlying true scores to see how much the functions have in common -Relationship amongst estimated correlation of true score/correlation of observed scores/ reliabilities of the two measures

Spearman-Brown Prophecy Formula for Split-Half Tests

Formula,named after the authors, to predict/estimate the reliability co-efficient of Split-Half tests, as it applies to predicting reliability of a "whole" test, by just using and giving only one test PROS:Unlike the Internal Consistency method, it does not assume homogeneous content across all items, but only between the two halves. It uses and gives only one test CONS:If halves are not equivalent, then the reliability co-efficient will be skewed. Different splits can produce different results and this method is therefore arbitrary.

DISCRIMINANT VALIDITY [Construct-Related Evidence of Validity] PREDICTIONS ABOUT CORRELATIONS

I find it easiest to think about convergent and discriminant validity as two inter-locking propositions. In simple words I would describe what they are doing as follows: CONVERGENT: measures of constructs that theoretically should be related to each other are, in fact, observed to be related to each other (that is, you should be able to show a correspondence or convergence between similar constructs) and DISCRIMINANTmeasures of constructs that theoretically should not be related to each other are, in fact, observed to not be related to each other (that is, you should be able to discriminate between dissimilar constructs)

CRITERION-REFERENCED TESTS/INTERPRETATION & [RELIABILITY OF]

INTERPRETED IN 3 WAYS: 1.Mastery vs Non-Mastery: Test scores fall either above/below pre-set cut off for mastery 2.Degree of Mastery: Scores compared to pre-set standard but scores who greatly exceed cutoff are considered to be greater masters than those who just barely surpass the cutoff i.e. levels in martial arts [white belt, yellow belt, black belt] 3.Domain Score: Test items sample a domain of content and is seen as a predictor of what he/she would have gotten on the test if all the questions from each domain were given. Concern is accuracy of domain score RELIABILITY OF: Interpretation of results all share the concept of consistency of information. -Mastery/Non-Mastery: Classified into one of two groups; those who have mastered it and those who haven't: -Done by Test-Re-test or Alternate Forms of test. Also a single test method similar to the KR-20 -Whichever test is used you must ask how often would the decision be reversed on the other testing; or how often would we have switched from pass to fail? -Best to use two short tests on mastery, rather than one long one -Use Variance Components [squares of Standard Deviation] to estimate degree of variation of mastery -Bottom line is that is difficult/complex to measure estimate of reliability for criterion referenced tests CONS: -If many are close to threshold if mastery [learned it but barely; more likely to be numerous reversals from one testing to another -If many have over-learned it or not learned it all, then there will be fewer reversals between testings -If teaching of skill was ongoing or not ongoing then may give the appearance of being unreliable PLAN: Use on group who have reached similar levels of assurance

VALIDITY DIAGONALS [Construct-Related Evidence of Validity] PREDICTIONS ABOUT CORRELATIONS

In construct validity, interest is usually focussed on the values in the "c" cells or VALIDITY DIAGONALS. These are the correlations between different ways of assessing the same trait [monotrait, hetromethod correlations]. Having 2-3 ways of measuring ascendance

UNIFIED VIEW OF VALIDITY

In early 1980's unified view of test validity/validation emerged. Messick [1989] argued that it is not just enough to create a test, but you have to use the results/scores for some interpretive purpose and use exclusion of other validation processes to prove that the chosen method of validation is the correct one i.e. looking for murderer and ruling out that no others could be the murderer. Score meanings must now include the value implications of score interpretation and the utility relevance, and social consequences associated with test use [both actual and potential]. The appropriateness of a particular test use must be justified in light of all possible outcomes. SUCH JUSTIFICATION REQUIRES NOT ONLY THAT THE TEST SERVE ITS INTENDED PURPOSE BUT ALSO THAT THE VALUE OF DOING SO OUTWEIGHS THE IMPACT OF ANY ADVERSE SOCIAL CONSEQUENCES! Unified view looks at test validation as a "process" of gathering evidence to build the best possible case for the inferences we would like to make in support of two types of inferences, and the social, personal and long term effects of using such inferences: 1.Interpretive Inference: A statement of what test scores means/our interpretation of the scores -Validation involves bring together multiple lines of evidence to support certain interpretations of test scores, whilst proving that other interpretations are less plausible -Assigning meaning to test scores carries certain value implications i.e. IQ test and its number 2.Action Inference: The appropriateness and utility of test scores as the basis for some specific actions such as applied decision making -Validation requires both evidence of score meaning and evidence of the appropriateness and usefulness of test scores for particular applied purposes.

CONTENT INSTRUCTIONAL OBJECTIVE

Includes 2 elements: 1. Information content of the domain -Exam Conditions 2. What individual should be able to do with it

ITEM INFORMATION FUNCTION

Is a curve on a graph that is plotted to describe the change in item information at different ability levels

STANDARD ERROR OF ESTIMATE [Criterion-Related Evidence of Validity/Empirical Validity]

Is an index of the error that may be made in forecasting performance on one measure from performance on another measure. It uses correlation between the predictor test and some criterion to provide an estimate of how much a predicted score might be in error as an estimate of a person's actual performance on the criterion. [SEE PREDICTION INTERVAL] -Is similar to the STANDARD ERROR OF MEASUREMENT: The Standard Error of Measurement is an index of instability of performances on a single test. It uses the test reliability to provide a way of estimating how much an individual's score might change from one testing to another. Standard deviations in our measurements i.e. different measurement weights from bathroom scale -Both the Standard Error of Measurement and the Standard Error of Estimate are important. But for Criterion-related validity, the Standard Error of Measurement is a property that can be determined by the test publisher as characterizing the predictor test. The opposite is true for the Stand Error of Estimate. The Standard Error of Estimate is unique to each criterion measure and must therefore be determined locally Standard deviation of criterion scores for people who all got the same score on the predictor variable -Caused when the relationship between a predictor and a criterion is less than perfect, and there will be some variability in criterion performance for people who all have the same predictor score ******************* -Gives a measure of the standard distance between a regression line and the actual data points -Allows one to quantify prediction errors when working with regression lines. It is much like standard deviation, and gives us a measure of the average deviation of the prediction errors about the regression line. -The average of the squared deviations about the regression line. -Affected by two variables, the standard deviation of the criterion and criterion-related validity.

Kuder-Richardson Formula 20 [KR-20]; Kuder-Richardson Formula 21 [KR-21]

Kuder-Richardson Formula 20 [KR-20]; -Named after the author and the numbering he gave to his formula. -A measure of internal consistency in which items are scored dichotomously [true/false; right/wrong], and given a score of either 1 or 0 -Applies best to homogeneous test items that measure the same trait of ability or personality -Is an alternative to the split half test method, to estimate the reliability of the full-length test. Kuder-Richardson Formula 21 [KR-21] -Same as KR-20 -Simpler formula, based on the assumption that all items are of the same difficulty -Seldom used as rarely are all items definitively of the same difficulty level. Computers can easily compute alpha with Excel spreadsheet. Many programs available to do these types of computations CONS: 1. -Represents only one single point in time since only one test has been given -Does not reflect usual day-to-day variance in individual -Evidence is limited to only that one "snapshot" in time 2. -Item sets may be more alike than parallel forms -Test items may be based on single groups of common reference material i.e. student is expected to answer 'a-d' of a single passage or described science experiment. Students who comprehend the single material generally will do better on similar item sets than students who don't -The assumption of homogeneous test to measure a single trait may be too restrictive. Often, one trait relies/overlaps the other, just as one form of measurement relies/overlaps the other. To write an English test on Shakespeare requires historical knowledge. -Relies on the test creator to use the appropriate method to measure variability. In the above example, it would be better to use parallel exam and a split-half co-efficient 3. -When test is highly speeded ALL single-administration reliability coefficients become meaningless/unreliable/invalid -Test is measuring speed, rather than the given trait -Speed factor tends to inflate estimates of reliability based on internal consistency procedures -The amount of overestimation depends on the increased degree of speed on tester -Be prudent when using speeded tests to test ability 4. -Split-half method is an alternative to use with speeded tests; -Each half can be administered as two independently timed short test -The correlation of the two halves gives a usable reliability coefficient

cumulative frequency distributions

Lists each score or interval and the number of scores falling in or below the score interval

VARIANCE COMPONENTS

Method used for criterion referenced tests to estimate reliability of degree of mastery measurement of the domains within the components. It allows us to estimate the effect of the various factors on the reliability of the components within the domain

Validity

NOTE: TEST VALIDITY It is not the existence of adverse impact that makes a proposed test use invalid; invalidity occurs when our scores misrepresent the construct being referenced [Does it measure what it is supposed to measure?]The degree to which test scores provide information that is relevant to the inferences that are to be made from them. -Validity is an essential quality for a test; reliability is a precondition for validity and it sets an upper limit to validity -Validation is always incomplete since we never have all the data -Validation is an ongoing process based on the most current data and interpretive or action inferences using evidence that is presently available -Central feature of validation process is the justification for the inferences through the testing/re-testing of hypotheses -Guided by organizing network of inter-relationship constructs that acts as bases for, and explanation of hypotheses -KEY CHARACTERISTIC:Combination of empirical and rational analysis should be used in all validation techniques -Any evidence regardless of its source that bears on the meaning of test scores contributes either positively or negatively to the validity of an inference -A thorough examination of construct validity should address both the internal structure of the test ad the relationships between test scores and other external variables as dictated by construct theory -Compelling support of score-based inferences both corroborating evidence and the ruling out of plausible rival hypotheses should be applied

VALIDITY GENERALIZATION

Named by Schmidt and Hunter to the application of meta-analysi -Refers to the extent to which a validity established in one setting with one sample can be generalized to another setting and sample. -Assumes that the results of criterion-related validity studies conducted in other companies can be generalized to the situation in your company. -The extent to which validity coefficients can be generalized across situations.

SUPPLY-RESPONSE ITEMS [Content-related Evidence of Validity]

One of the two categories [the other is SELECT RESPONSE ITEMS] that types of test items can be classified into. -Examinees produce their own answers; sometimes called SUPPLY RESPONSE/CONSTRUCTED-RESPONSE ITEMS

CONSTRUCT-RESPONSE ITEMS/ AKA :Produce-Response or Supply-Response Items [Content Related Evidence of Validity]

One of the two categories [the other is SELECT RESPONSE ITEMS] that types of test items can be classified into. -Examinees produce their own answers; sometimes called SUPPLY RESPONSE/CONSTRUCTED-RESPONSE ITEMS CONS: -Many skills are too complex to be measured effectively with multiple choice questions; it can not determine if they can construct/organize/write and essay or identify/use the math protocols for completing a problem -Some testees can not provide the correct response without seeing it -The limitation of multiple choice test questions has educational consequences, if used as the basis for important decisions about the effectiveness of schools, teachers have a strong incentive to emphasize the skills and knowledge tested by the questions on those tests i.e. TEACHING TO THE TEST!!!With a limited amount of time, they often have to give a lower priority to the kinds of skills that would be tested by constructed response questions SCORING [Analytical Rubric & Holistic] -Analytical Rubric lists specific features of the response and the number of points to award for each feature -Holistic: Makes a single judgment of the quality of the response by assigning a numerical score, that is based on comparing the response to actual rated samples of responses. Scores are assigned by how closely the response matches the exemplars of 1,2,3,4,5 point responses. Exemplars also include borderline cases i.e. a response that just barely qualifies as a 5 or a response that narrowly misses earning a 5

THEORITICAL RELIABILITY

Reliability estimates obtained under optimum testing conditions i.e. [having highly trained practioners]

MESSICK'S EXPANDED THEORY OF VALIDITY [Unified View of Validity/Construct Validity as the Whole of Validity]

SEE PAGE 188 FOR LENGTHY DEFINITION "...as an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores..."[4] Key to construct validity are the theoretical ideas behind the trait under consideration, i.e. the concepts that organize how aspects of personality, intelligence, etc. are viewed.[5] Paul Meehl states that "The best construct is the one around which we can build the greatest number of inferences, in the most direct fashion."[6] Samuel Messick, most authoritative/cited reference on validity: 1. Established construct validity as single, most important conceptualization of validity 2. Expanded beyond issues of score meaning to include evidence of relevance and utility to test scores, and consequences and considerations of said use -"Consequences as part of construct validation" has yet to be embraced by all members of testing community, it is seen as a major milestone [benchmark] in the evolution of validity theory -Messick's model is unified, yet multi-faceted VALUE IMPLICATIONS OF TEST INTERPRETATION Values are intrinsic to validity and involves matching the behaviour to some larger schema to which value is already attached Values Impact Score Meaning on 3 Levels 1. The labels we give to constructs which change our value of them i.e. test/assessment; flexibility/rigidity; 2. The theories that give constructs their meaning. A theory is broader than a construct and has its own values/assumption. A constructs gains meaning from its place within the given theory and the bent of the theorizer; hence competing theories of the same phenomenon. I.E. Aptitude test results static vs flexible; invalid vs valid [gender, race] 3. The ideologies reflected in those theories. Ideology refers to a "complex configuration of shared values, affects, and beliefs that provide an existential framework for interpreting the world". It is the ideology in which theories are embedded -Ideology impacts the questions we ask, seek, meaning of i.e. nature vs. nurture; interpretation based on commonalities vs individualities Problem with Ideology -Dealing with values stemming explicitly from ideology is that these values are often tacit [implied]. Therefore, you have to know what to look for, uncover them, before impact on score can be considered SOCIAL CONSEQUENCES OF TEST USE Consideration of both the actual and potential outcomes must be based in construct theory. Answers the questions, "Does This Test Do What It Is Supposed To Do?" All of its effects intended and unintended, positive and negative determine whether or not a proposed test use is justified. This is especially true if adverse consequences have a basis in test invalidity -Potential: The reasonable anticipation of results both desired and unintended; just as it did about hypothesizing about relationships between test scores and criteria -Actual: Provide empirical evidence that contributes directly to score meaning through the confirmation or refutation of expectations NOTE: TEST VALIDITY It is not the existence of adverse impact that makes a proposed test use invalid; invalidity occurs when our scores misrepresent the construct being referenced

MESSICK VS: SHEPARD'S; CRONBACH MODELS OF CONSTRUCT VALIDITY

SHEPERD'S main concern revolves around how issues are communicated. 1. Messick is too lengthy and too difficult to wade through 2. Messick interpretation of unified model could be undermined because he deconstructed model into 4 parts and cause users to focus on a part, rather than the whole CRONBACH'S notes validation occurs in very public arena. Public interested in it and debate it. Therefore, validators should test validity as a debate, examining all sides, and argue effectively from all sides. Public acceptance/rejection of proposed test use happens when community is persuaded by an argument that closely aligns with the prevailing community belief system

CONSTRUCT UNDER-REPRESENTATION [Unified View of Validity]

Scope of test is too narrow and it fails to include important aspects of the construct ******************************************** -Construct underrepresentation: Ex: being shy....How do you feel being shy? How do you feel when people want to include you? FORGETS to ask about behaviors ie Do you enjoy being on stage? Do you like working in groups?, -The degree to which the opreational definition fails to capture important aspects of the construct. ex) increase in stock value may be relevent to the construct but it may not present a complete measure of the corporations performance -A threat to construct validity, characterized by measures that do not fully define the variable or construct of interest

GENERALIZATION TEST METHODS

See Page 194 Notion of generalizability encompasses both reliability and validity -The two concepts differ only in the breadth of the domain to which generalizaton is undertaken 1. Test 2. Alternate form of test 3. Other tests of the same trait/construct 4. Non-test appraisals of the same trait/construct

RELIABILITY & VALIDITY OVERLAP

See Page 194 Notion of generalizability encompasses both reliability and validity -The two concepts differ only in the breadth of the domain to which generalizaton is undertaken 1. Test 2. Alternate form of test 3. Other tests of the same trait/construct 4. Non-test appraisals of the same trait/construct

CONTENT-RELATED VALIDITY: TEST BLUEPRINT AKA: TABLE OF SPECIFICATIONS

TEST BLUEPRINT • Test Blueprint [Table of Specification]: is an explicit plan that guides test construction. Standardized achievement test applies the same procedures, but to more broadly specified content/curricula. Basic Components are: o Specification of cognitive processes o Description of content to be covered by the test Specification and Description need to be matched to show which process relates to each segment of content and to provide a framework for the development of the test o Method[s] to be used in evaluating student progress toward achieving each objective ISSUES TO GUIDE MAKING BLUEPRINT 1. What emphasis should each of the content areas and cognitive processes receive on the test? o # of questions=#hrs on teaching concept and/or as prescribed by course objectives 2. What type[s] of items should be included on the test? o Supply-Response/ Produce-Response/ Constructed-Response [Examinees provide own response] 3. Length of test? How many items in total? How many items per section/cell? o Essay type; only time for a few questions o Length/Complexity of items: More elaborate the answers needed=fewer questions Amount of computation and/or quantitative thinking needed o Time given for testing. Most achievement tests should be "power" not "speed" tests. There should be enough time for at least 80% of the students to attempt to answer every question o Age/educational level of testee[s] o Ability level[s] of testee[s] o Typical adult: 30-45 secs for factual multiple choice and/or T/F and 75-100 secs for fairly complex multiple choice and/or T/F o If rushed, errors occur due to lack of time to properly read/analyze, thereby reducing reliability/validity o If total # items>than needed time, then break into two or more subtests to be given on successive days. This is what standardized tests do 4. Difficulty level of items o If essay; every testee can answer but at varying levels of completeness o If objective items; most commonly used index of difficulty is by dividing the number of students getting the item correct by the total number of student attempting the item. Assume all students will attempt items. The higher the index the easier the difficulty level i.e. 75%=.75 difficulty then 75% of class got item right o Average difficulty and spread of item difficulty differs from norm-referenced to criterion-referenced tests; the latter is usually easier Diagnostic tests to isolate students having difficulty; expect large number of perfect or near-perfect marks and minimal variability VS Criterion-Referenced / Diagnostic/or Pre-Tests [given prior to teaching ]; expect to get zero or near-zero scores When purpose is to discriminate levels of achievement then test needs to yield a spread of scores or increased variability and we don't want items that everyone gets correct, nor do we want items that everyone gets wrong. Neither allows to discriminate levels of achievement Test difficulty should be half-way between 100% and the # of items a student could get right by guessing, because guessing adds chance variability to test scores; The probability of getting an item right by chance is 1 divided by the number of answer choices. If 5 items chance variability 5 divided by 1=.20/20% o Test difficulty is half-way between chance variability and perfect scorei.e. if chance variability is 20% or.20 then halfway between 20 and 100 is 60%/.60 Therefore better to have slightly easier than harder test . Probability to guessing the right answer to a supply-response item is assumed to be zero. Preferred Level of Difficulty for: • Multiple choice with four options; .65-.70 because the probability of getting an answer right by chance is 1 divided by the # of answer choices • T/F .75-.80 o How To Find Valid Goals of Instruction [widely used textbooks, recent district courses, reports from special groups that appear in yearbooks of educational societies, groups of teachers who give instruction on topics, specialists in higher levels/ministry dept. ]

MONOTRAIT MONOMETHOD CORRELATIONS [Construct-Related Evidence of Validity] PREDICTIONS ABOUT CORRELATIONS

THE RELIABILITY DIAGONALS (monotrait-monomethod) Estimates of the reliability of each measure in the matrix. You can estimate reliabilities a number of different ways (e.g., test-retest, internal consistency). There are as many correlations in the reliability diagonal as there are measures -- in this example there are nine measures and nine reliabilities. The first reliability in the example is the correlation of Trait A, Method 1 with Trait A, Method 1 (hereafter, I'll abbreviate this relationship A1-A1). Notice that this is essentially the correlation of the measure with itself. In fact such a correlation would always be perfect (i.e., r=1.0). Instead, we substitute an estimate of reliability. You could also consider these values to be monotrait-monomethod correlations.

CONSTRUCT VALIDITY [Construct-Related Evidence of Validity]

Term that refers to to something that is not observable but is literally 'constructed" by the investigator to summarize or account for the regularities or relationships in observed behaviour. Thus, most names of traits refer to constructs. -Doesn't answer questions that predict future performance or how it represents curriculum. Instead, it helps answer questions: -What does the score on this test tell us about the individual? -Does it correspond to some meaningful trait or construct that will help us understand the individual? EVIDENCE SUPPORTING SCORE-BASED INFERENCES -Sliding scale of validity; not absolute [either valid or not valid]; based on levels of evidence to build case for support of inferences -Rational & Empirical evidence needed for an effective argument to support construct validity of score-based inference -Justification of inferences requires both reasoning in light of the construct being measured and the presence of relevant data -Assertions/hypotheses must be tested against reality -Data is meaningless unless interpreted within a theoretical context. The data, "does not speak for itself", it must be interpreted -Validation is always incomplete since we never have all the data -Validation is an ongoing process based on the most current data and interpretive or action inferences using evidence that is presently available -Central feature of validation process is the justification for the inferences through the testing/re-testing of hypotheses -Guided by organizing network of inter-relationship constructs that acts as bases for, and explanation of hypotheses -Combination of empirical and rational analysis should be used in all validation techniques THREATS TO CONSTRUCT VALIDITY -Test scores and the imprecision resulting from random errors of measurement -Test scores are inadequate measures of the constructs they are to assess in TWO WAYS and are similar to the issues with Content Valdity: 1.Construct Underrepresentation [The scope of test is too narrow and it fails to include important aspects of the the construct 2.Construct-irrelevant Test Variance [the presence of reliable variance that is extraneous to the construct being quantified CENTRALIZED vs. TRADITIONAL EVIDENCE Construct validity is seen as central to, and the basis of all measurements of construct, when compared to traditional evidence [content and criterion related evidence] -The traditional,narrower view of construct validity is involves providing evidence to support assertions that test scores or criterion measures actually tap into the underlying constructs they claim to represent -Used together [content/criterion] they provide stronger evidence base for trait being measured than when used alone Construct -Construct validity uses all available lines of evidence to support score meaning -With construct validation, the meaning of the measure must be invoked every time to substantiate an interpretation of a test score or justify the use of the test score for an applied purpose -Construct simultaneously impacts scores meaning by indicating what sorts of characteristics or behaviours are being sampled by the test, while it helps us judge the relevance and representativeness of test content to the desired interpretive or action inferences -Compared to criterion-empirical, construct theory provides a rational explanation for any observed test-criterion relationships, which, in turn, helps to justify test use Content -Content evidence impacts scores meaning by indicating what sorts of characteristics or behaviours are being sampled by the test Criterion-Emperical -Emperical test-criterion relationships provide evidence for the construct validity of inferences regarding theoretical relationships between traits of behavioural domains those measures are presumed to reflect -Test scores and criterion measures are indicators of underlying traits or behavioural domains

HETEROTRAIT, HETEROMETHOD CORRELATIONS [Construct-Related Evidence of Validity] PREDICTIONS ABOUT CORRELATIONS

The "d" cells. These are the correlations between different traits measured by different methods [heterotrait, heteromethod correlations]. They represent the purest index of the association between the different traits, and they should be the smallest values in the matrix because they include only trait correlations.

STANDARD ERROR OF MEASUREMENT [SEM]

The Standard Error of Measurement is an index of instability of performances on a single test. It uses the test reliability to provide a way of estimating how much an individual's score might change from one testing to another. Standard deviations in our measurements i.e. different measurement weights from bathroom scale -Is similar to the STANDARD ERROR OF ESTIMATE: Is an indes of the error that nay be made in forecasting performance on one measure from performance on another measure. It uses correlation between the predictor test and some criterion to provide an estimate of how much a predicted score might be in error as an estimate of a person's actual performance on the criterion. -Both the Standard Error of Measurement and the Standard Error of Estimate are important. But for Criterion-related validity, the Standard Error of Measurement is a property that can be determined by the test publisher as characterizing the predictor test. The opposite is true for the Stand Error of Estimate. The Standard Error of Estimate is unique to each criterion measure and must therefore be determined locally. ******************************************** -It is the standardized rate of error that one can expect when measuring a given trait -Relationship between the reliability coefficient and the standard error of measurement [SEM can be estimated from the reliability coefficient] -Standard deviation that would be obtained for a series of measurements of the same individual using the same test; it is assumed that the individual is unchanged by using the same test -The SEM derived from the appropriate design is the best index of the consistency we can expect to find in individual scores

MONOTRAIT, HETEROMETHOD CORRELATIONS [Construct-Related Evidence of Validity] PREDICTIONS ABOUT CORRELATIONS

The Validity Diagonals (monotrait-heteromethod) Correlations between measures of the same trait measured using different methods. Since the MTMM is organized into method blocks, there is one validity diagonal in each method block. For example, look at the A1-A2 correlation of .57. This is the correlation between two measures of the same trait (A) measured with two different measures (1 and 2). Because the two measures are of the same trait or concept, we would expect them to be strongly correlated. You could also consider these values to be monotrait-heteromethod correlations.

ITEM INFORMATION

The amount of information that an item question provides -Function of the "change in probability of a correct response" -The more rapid the change in probability of a correct response the [a steeper slope], the greater the information provided by the item

RELIABILITY DIAGONALS [Construct-Related Evidence of Validity/Multitrait, Multimethod Analysis (MTMM)]

The cells of the matrix that contain the letter "a". They represent the correlation between two assessments of the same trait using the same measurement method [reliability is characterized as being the correlation of a test with itself]. -The values should be as high as possible and they should be the highest values in the matrix -In the parlance[jargon] of MTMM analyses, these are monotrait, monomethod correlations

EMPIRICAL [STATISTICAL VALIDITY]

The collection of empirical and/or statistical evidence to use as an evaluation of a test as a predictor -Can be estimated by determining the correlation between test scores and the 'suitable' criterion measure of success for the job or classroom -This predictor uses criterion and criterion-related validity to establish what is commonly known as "PREDICTIVE VALIDITY" CRITERION PROBLEM: -Which criterion measure is best, valid as a predictor of future performance? -Is in the portion of the definition of empirical validity when it refers to 'suitable criterion measure'; which is the most difficult to determine: -some are difficult to measure/quantify i.e. relationship with parents, clients -influenced by factors outside control of individual; student home-life, health, nutrition -effectiveness of equipment, technology or lack there-of -All criterion measures are partial in the sense that they measure only a part of success on the job or in academics SOLUTIONS: -Subjective: Rating scales [can be erratic; dependent on person giving rating] -Tests of Proficiency i.e. university entrance english/math test used to validate/predict performance on comprehensive university english/math exam -Using average of grades in some educational or training program i.e. tests for selections of engineers, lawyers, welders based on educational or training program scores QUALITIES IN CRITERION MEASURE [See Criterion]

PREDICTIVE VALIDITY

The collection of empirical and/or statistical evidence to use as an evaluation of a test as a predictor -Can be estimated by determining the correlation between test scores and the 'suitable' criterion measure of success for the job or classroom -This predictor uses criterion and criterion-related validity to establish what is commonly known as "PREDICTIVE VALIDITY" -When the test info is used to predict future performance based on a given criterion -The success with which a test predicts the behavior it is designed to predict; it is assessed by computing the correlation between test scores and the criterion behavior. -Predictive and concurrent validity have more to do with the purpose, rather than the time relationship of the study -Predictive validity is the more wide-spread concern

METHOD COVARIANCE [Construct-Related Evidence of Validity] PREDICTIONS ABOUT CORRELATIONS

The extent that the values in the "b" cells are larger than those in the "d" cells. There are consistent individual differences in scores that result not only from status on the the trait, but also from the way the trait is assessed. -The presence of method covariance reduces the construct validity of the measure because scores on the the instrument include something else in addition to the person's status on the trait of interest.

DISCRIMINANT VALIDITY [Construct-Related Evidence of Validity] PREDICTIONS ABOUT CORRELATIONS

The extent to which measurements are free of method variance and are pure measures of discrete traits ******************************************* -Extent to which scores obtained from one procedure are not correlated with scores from another procedure that measures OTHER variables or constructs. -When a test has a LOW correlation with another test that measures a different construct i.e. A test of mechanical ability should not have a high correlation with a test of reading ability. -A method of construct validation that reflects the degree to which an instrument can distinguish between or among different phenomena or characteristics

Internal consistency reliability

The homogeneity of an instrument; the extent to which all of its subparts are measuring the same characteristic. -Procedure to assess this include: -split-half techniques -Cronbach alpha or -coefficient alpha

CONSTRUCT-IRRELEVANT TEST VARIANCE [Unified View of Validity]

The presence of unreliable variance that is extraneous to the construct being quantified. This and construct under-representation are similar to the issues raised in content validity TWO MAIN TYPES 1. Construct-irrelevant Test Difficulty [Test Invalidation] -The presence of factors unrelated to the construct being assessed that make the test more difficult for some individuals or groups -I.E. test anxiety can negatively impact score compared to anxiety/score for homework or assignments. The result are scores that are invalidly lower than actual ability or standing on construct 2. Construct-irrelevant Test Easiness [Test Contamination] -When something about the test cues provide testees to respond correctly in ways that are unrelated to the construct [higher on the construct than they would normally] -'Testwiseness' is the name given to the above [familiarity with item types, effective time management strategies, recognition of patterns among correct answers, etc. Solution: Use multiple indicators of each construct

BASE RATES & PREDICTION[Use of Predictor Test]

The proportion of a group of applicants who would have succeeded if all were admitted the 'BASE RATE OF SUCCESS' -If being in the top half of the criterion is considered success, then the base rate for success is .50. If it's the top 25% the base rate is .25 -As the base rate departs from 50% [either higher-lower], the potential 'HIT RATE' for the selection procedure diminishes -HIT RATE: The proportion of correct decisions [sum of proportions of success and rejections] that result from selection strategy -Each 'CUTTING SCORE' has a 'HIT RATE' that depends on the 'BASE RATE of SUCCESS' in the population and the correlation between the predictor and the criterion -The selection rule used with the predictor of success to determine the cut-off point or lowest predictor score that leads to accept the applicant is called the "CUTTING SCORE" -Used as a base to compare the prevalence of success in the general population without any form of selection criteria, to that of a select group of applicants for a finite # of positions - percent who are performing satisfactorily without use of the proposed predictor - ranges from 0 to 1.0 - moderate base rates (.50) are associated with the greatest incremental validity PREDICTION;The decrease in overall gain from using the predictor becomes much more marked in situations where the rate of successful criterion performance in the population is close to 50%

HIT RATE [SEE CUTTING SCORE]/Criterion Related Evidence of Validity/Interpretation of Validity Coefficients

The proportion of correct decisions that result from a given selection strategy [the proportion of correctly identified successes, plus the proportion of correctly rejected failures] -The proportion of people who are accurately identified as possessing or not possessing a particular trait, behaviour, characteristic, or attribute based on test scores -As the base rate departs from 50% [either higher-lower], the potential 'HIT RATE' for the selection procedure diminishes -HIT RATE: The proportion of correct decisions [sum of proportions of success and rejections] that result from selection strategy -Each 'CUTTING SCORE' has a 'HIT RATE' that depends on the 'BASE RATE of SUCCESS' in the population and the correlation between the predictor and the criterion

Observed score or measurment

The real score + some error of measurement

CUTTING SCORE [SEE HIT RATE]Criterion Related Evidence of Validity/Interpretation of Validity Coefficients

The selection rule used with the predictor of success to determine the cut-off point or lowest predictor score that leads to accept the applicant is called the "CUTTING SCORE" -As the base rate departs from 50% [either higher-lower], the potential 'HIT RATE' for the selection procedure diminishes -HIT RATE: The proportion of correct decisions [sum of proportions of success and rejections] that result from selection strategy -Each 'CUTTING SCORE' has a 'HIT RATE' that depends on the 'BASE RATE of SUCCESS' in the population and the correlation between the predictor and the criterion -Criterion against which each individual's discriminant Z score is compared to determine predicted group membership. When the analysis involves two groups, group prediction is determined by computing a single cutting score. Entities with discriminant Z scores below this score are assigned to one group, whereas those with scores above it are classified in the other group. For three or more groups, multiple discriminant functions are used, with a different cutting score for each function., -The score marking the point of decision when dichotomizing individuals.

PRACTICAL RELIABILITY

The use of the test in "real life" or "practical application". When the test is used in practice, the real or practical reliability of the scores will be lower than the estimated score. Consequently the standard error of measurement will be higher when a test like the WIAT is used in the field than the value reported by the publishers

PREDICTIONS ABOUT GROUP DIFFERENCES [Construct-Related Evidence of Validity]

Theories about groups are often based on general knowledge about the traits given to that group i.e. math teacher will be good at logical mathematical reasoning. The tester applies a test to the group to confirm the prediction of the trait. -Failed prediction may be due to invalid measure of the trait or characteristic OR the world is not consistent with the theory that gave rise to the prediction i.e. stereotypes -Failure does not tell us which of these conditions exist

RELIABILITY IN DIFFERENCE OF SCORES

This deals with the relationship of two scores of an individual [at different times; different aptitudes] to compare gains -Reliability in difference is lower than when compared to the reliability of the two tests for that individual -Lower reliability becomes a problem whenever we wish to use test patterns for diagnosis

FACTOR ANALYSIS [Construct-Related Evidence of Validity] PREDICTIONS ABOUT CORRELATIONS

This means of studying the constructs that underlie performance on a test is FACTOR ANALYSIS.The studying of constructs or traits that a test measures by jointly studying the intercorrelations of this test and a number of other tests. The patterning of these correlations makes it possible to see which tests are measuring some common dimensions or factor. An examination of the tests clustering in a single factor may clarify the nature and meaning of the factor and of the tests that measure it. -Needs evidence of a relationship to life events outside the tests, if the factors are to have much substance, vitality, and scientific or educational utility -May also need predict low or no correlations to measures of other constructs/attributes so that it can be differentiated from what other tests measure i.e. A test measuring mathematical ability should not turn into a reading test -High correlations between measures of supposedly unrelated constructs are evidence of the lack of validity of the instruments as: -measures of construct -the validity of the constructs as separate dimensions of human functioning -or both ******************************************* -Enables researchers to identify clusters of test items that measure a common ability, -Any of several methods for reducing correlational data to a smaller number of dimensions or factors -A statistical procedure that identifies clusters of related items (called factors) on a test; used to identify different dimensions of performance that underlie one's total score

Speed Test

Timed test that is highly speeded -Score differences depend on the number of items attempted -More able test-takers will complete more items; but everyone who attempts each item should get it right -Opposite to a Power Test -Most tests are a combination of speed and difficulty

CONVERGENT VALIDITY [Construct-Related Evidence of Validity] PREDICTIONS ABOUT CORRELATIONS

To the extents that different ways of measuring the same trait yield high correlations, the construct and its measures are said to demonstrate CONVERGENT VALIDITY. ******************************************** It is easiest to think about convergent and discriminant validity as two inter-locking propositions. In simple words I would describe what they are doing as follows: CONVERGENT: measures of constructs that theoretically should be related to each other are, in fact, observed to be related to each other (that is, you should be able to show a correspondence or convergence between similar constructs) and DISCRIMINANTmeasures of constructs that theoretically should not be related to each other are, in fact, observed to not be related to each other (that is, you should be able to discriminate between dissimilar constructs)

PREDICTIONS ABOUT RESPONSES TO EXPERIMENTAL TREATMENTS OR INTERVENTIONS [Construct-Related Evidence of Validity]

Traits can be modified as a result of experimental treatments or interventions i.e. measuring stress related situations such as going to the dentist. For any test that presumes to measure a trait or quality, a network of theory leading to definite predictions can be formulated. -If the predictions are confirmed, the validity of the test as a measure of the trait or construct is supported. -If the predictions fail to be verified, the validity of the test, its theory or both will be doubted

Power Test

Untimed test with increasing difficulty -Can take as much time as needed -More able test-takers will be able to complete more levels of difficulty -Most tests are a combination of speed and difficulty

VALIDATION AS A SCIENTIFIC ENTERPRISE [Unified View of Validity]

Validation methods are scrutinized, peer reviewed/tested as the science of validation. Hypotheses are stated that they can be refuted. Confirmation must be backed up with evidence to the contrary in order to challenge the hypotheses. Multiple explanations can be given for just about any phenomenon. To justify desired score-based inferences, competing theories must be explored and ruled out. Validation is also theory testing and identification and testing of counter-hypotheses is an efficient way of exposing any vulnerabilities in a given construct theory. Good outcome means good/valid theory and vice versa. VALIDATION PROCESS: 1.State desired inferences or hypothesis [types of interpretations and uses of test scores we would like to make] 2.Determine the kinds of evidence to be accepted as confirming/refuting the plausibility of given inferences 3.Determining the reasonableness of the desired inferences in light of any accumulated evidence -Science no longer "value-neutral". Any scientific enterprise that occurs in a political setting is by definition, "APPLIED SCIENCE" -Recognizing and taking into account the factors that have helped shape our positions may actually strengthen the case in support of a particular inference

IMPLICATIONS OF TEST USAGE [SOCIAL/VALUE BASED]/CONSTRUCT VALIDITY/ MESSICK

Values Impact Score Meaning on 3 Levels 1. The labels we give to constructs which change our value of them i.e. test/assessment; flexibility/rigidity; 2. The theories that give constructs their meaning. A theory is broader than a construct and has its own values/assumption. A constructs gains meaning from its place within the given theory and the bent of the theorizer; hence competing theories of the same phenomenon. I.E. Aptitude test results static vs flexible; invalid vs valid [gender, race] 3. The ideologies reflected in those theories. Ideology refers to a "complex configuration of shared values, affects, and beliefs that provide an existential framework for interpreting the world". It is the ideology in which theories are embedded -Ideology impacts the questions we ask, seek, meaning of i.e. nature vs. nurture; interpretation based on commonalities vs individualities Problem with Ideology -Dealing with values stemming explicitly from ideology is that these values are often tacit [implied]. Therefore, you have to know what to look for, uncover them, before impact on score can be considered SOCIAL CONSEQUENCES OF TEST USE Consideration of both the actual and potential outcomes must be based in construct theory. Answers the questions, "Does This Test Do What It Is Supposed To Do?" All of its effects intended and unintended, positive and negative determine whether or not a proposed test use is justified. This is especially true if adverse consequences have a basis in test invalidity. -Judgements of the functional worth of test scores are often used as a basis for social action, such as decisions about the allocation of limited resources. -Not just use of test to be justified but selection process [how was test chosen] that must be justified -SOLUTION: Apply a different decision rule informed by a different set of test scores or even no scores at all; Is allocation fairer or has the perceived unfairness simply been shifted to another group i.e. NIMBY effect; Same problem, different place Potential: The reasonable anticipation of results both desired and unintended; just as it did about hypothesizing about relationships between test scores and criteria Actual: Provide empirical evidence that contributes directly to score meaning through the confirmation or refutation of expectations NOTE: TEST VALIDITY It is not the existence of adverse impact that makes a proposed test use invalid; invalidity occurs when our scores misrepresent the construct being referenced FOUR COMMON VALUE BASED USE OF TEST SCORES FOR ACADEMIC DECISIONS: Cannot serve all values simultaneously; must decide which is most important given the context 1. Selection according to ability 2. Effort 3. Accomplishments 4. Need

SELF-REFERENCING

Way of measuring student growth against the individual and NOT others! Most commonly used in IEPs and students with disabilities.

PREDICTION INTERVAL[VALIDITY]

We may use standard error of estimate to place a band of uncertainty or a [PREDICTION INTERVAL] around our predicted score in the same way as we would for STANDARD ERROR OF MEASUREMENT. ******************************************** -An interval estimate of plausible values for a single observation of Y at a specified value of X -Measurement of the certainty of the scatter about a certain regression line. A 95% prediction band indicates that, in general, 95% of the points will be contained within the bands.

VALIDITY THEORY & TEST BIAS

What is evaluated as being "biased" or "unbiased" is an inference based on a test score, NOT THE TEST [or an item] itself. An INFERENCE is judged to be valid when there is sufficient rational and empirical evidence supporting it and evidence supporting conflicting evidence is lacking [SEE INFERENCE] -If test scores fail to reflect the intended construct the same way across groups/situations score interpretation bias occurs -When test scores differentially predict criterion performance for different groups, the use of those scores in decision making is biased

CONCURRENT VALIDITY

When scores on the test and the criterion are obtained at essentially the same time. Used when we want to substitute the scores on one test for those on another, often because the latter is too expensive, time consuming, elaborate to do i.e. using a group test of intelligence instead of an individual. We want to know if scores obtained concurrently are valid. -Simply, it predicts the degree of correlation between a given aptitude test and the actual number of people able to succeed in the construct that the aptitude test is predicting. For example if students score high on aptitude for cooking, chances are they will score high for completion of cooking course and vice versa

Reliability

[Does it consistently give the same measurement?]The accuracy or precision of a given measurement tool or procedure -Test-Re-test -Parallel Form -Single AdministrationTest -Intended use of scores effects the type of reliability needed: -If assessing current levels of performance then use a reliability coefficient that reflects the internal consistency of the test -By contrast, if the scores are used as predictors of future performance, then a reliability index that shows the stability of those scores over an equivalent time span is needed

CRITERION-RELATED VALIDITY

[SEE ALSO CONTENT VALIDITY RELATED; CRITERION REFERENCED TESTS] The correlation between the test scores and the criterion. The higher the correlation the more effective the test is as a predictor and the higher the criterion-related validity. Procedure: Give test on specific skill to testee, followed up after giving training with skill, and measure success based on a given criterion. Two types of criterion-related validity; predictive validity, and concurrent validity. The difference is caused by the purpose and the time span between collecting information and the criterion reference LIMITATIONS OF CRITERION RELATED VALIDITY: Criterion-related validity is most important for a test that is to be used to predict outcomes that are represented by clear-cut criterion measures. 1. The main limitations to using criterion-related validity with the prediction context usually lies in the limited adequacy of the available criterion measures 2. The more readily we can identify a performance criterion that unquestionable represent the results that we are interested in, the more we will be prepared to rely on the evidence from correlations between a test and measures of that criterion to guide our decision on whether to use the test scores SEE: 1. FACE VALIDITY 2. EMPIRICAL VALIDITY -Empirical/Statistical Validity -Criterion -Criterion-Related Validity -Predictive Validity -Concurrent Validity -The Problem of Criterion -Qualities Desired in Criterion Measure -The Practice of Prediction -Interpretation of Validity Coefficients ] -[selection ratio/base rates and prediction] -Standard Error of Estimate

stereotype threat

a self-confirming concern that one will be evaluated based on a negative stereotype

heritability estimate

a statistical estimate of the degree of inheritance of a given trait or behavior

divergent thinking

an aspect of creativity characterized by an ability to produce unusual but appropriate responses to problems

intelligence quotient

defined originally as the ratio of mental age to chronological age multiplied by 100

internal consistency

the degree to which test yields similar scores across its different parts, such as odds versus even items

crystallized intelligence

the facet of intelligence involving the knowledge a person has already acquired and the ability to access that knowledge

psychometrics

the field of psychology that specializes in mental testing

psychological assessment

the use of specified procedures to evaluate the abilities, behaviors, and personal qualities of people


Conjuntos de estudio relacionados

Med Surge Chap 64 Neuro Point Questions

View Set

Chapter 16 - The Molecular Basis of Inheritance

View Set

English interpersonal communications midterm

View Set