Assessment Test 2
Multiple cutoff method
faster then multiple regression -brings in the question of the validity bc it is faster, but it might not be as accurate bc one score that does not meet the cutoff can completely ruin the case and it would be thrown out
Reliablity and validity
having a test that is reliable does not make it valid, BUT if it is valid then it is reliable *reliability is a necessary, but not sufficient condition in the validation process -it has to be reliable to be valid but it doesn't have to be valid to be reliable
Controversies in Assessment: "Teaching to the Test" Inflates Scores
-"Teaching to the test" means that the focus of instruction becomes so prescribed that only content that is sure to appear on an exam is addressed in instruction. If this occurs, test scores should rise. -Whether test scores are inflated in this instance is a matter of content mastery. -Test publishers, state education departments, and local educators must work collaboratively to develop test items that adequately sample the broad content domain and standards.
The Classical Model of Reliability
-A person's true score is the score the individual would have received if the test and testing conditions were free from error. (never get to see) -Observed score = True score + error (X = T + E) -ex. Math score 85 = 89 + -4 (distractions, motivation, upset stomach, mood, anxious) -typically error will be about the same variance (+-4) -Systematic error remains constant from one measurement to another and leads to consistency. ex. when one counselor assigns two points lower than another counselor to each person in a group of examinations -Random error, which does affect reliability, is due to: 1. Fluctuations in the mood or alertness of persons taking the test due to fatigue, illness, or other recent experiences. 2. Incidental variation in the measurement conditions due, for example, to outside noise or inconsistency in the administration of the instrument. 3. Differences in scoring due to factors such as scoring errors, subjectivity, or clerical errors. 4. Random guessing on response alternatives in tests or questionnaire items. *reliablitity of measurement is the extent to which the score results are free from error
Decision Making Using a Single Score
-Decision theory involves the collection of a screening test score (their actual score) and a criterion score (what the doctor thinks their score is), either at the same point in time or at some point in the future. You want to maximize hits (valid acceptances and valid rejections) and minimize misses (false rejections and false positives). *MINIMIZE FALSE REJECTION in quadrant 2 -standard set too high will deny needy clients -standard set too low will allow ppl to be in limited needed services -Linear regression = Y = .4+.1X X = screening score Y = Criterion score (.4+.1 is given to us and can change) -purpose of linear regression = way to predict criterion score -Setting a cutoff score (where they either fall into not a diagnosis or an actual diagnosis) -> don't set too high bc you don't want to miss ppl and you can factor them out later; doesn't have to be an exact score to pass or fail it can be like the SAT where there is no failing and you just pick a percentile that you want
Types of Validity
-Face Validity -Content-Related Validity -Criterion-Related Validity -Predictive criterion-related validity -Concurrent criterion-related validity -Construct Validity ex. of validity = can get reliable Blood pressure measure but that does not mean that you don't have heart diease
Face Validity
-Face validity is derived from the obvious appearance of the measure itself and its test items, but it is not an empirically demonstrated type of validity (suicide -> i want to kill myself) -Self-report tests with high face validity can face problems when the trait or behavior in question is one that many people will not want to reveal about themselves. (ex. taking a psychiatric test to get into the military, you can fake your answers so that you get the job, can fake good or bad -> obvious types of items, but it is good bc the person is not confused about what is being asked) -ADVANTAGES AND DISADVANTAGES -surveys (identifying age, education etc) and self-reports -for tests in certain domains )achievement, intelligence), face validity can add credibility or acceptance to the assessment process
Validity Defined
-Validity indicates the degree to which test scores measure what the test claims to measure. -important esp for diagnosis c you can't make a diagnosis if you don't know if you are measuring the right thing -reliability = consistency -validity = measure what you want to measure KNOW AN EXAMPLE OF VALIDITY
Types of reliability - internal consistency
-Internal consistency = correlation of items against each other; estimates of reliability are based on the average correlation among items with a test or scale; uses the split-half method (researcher divides the questions into 2 halves either by an odd-even method or by some other strategy -> each half of the items is treated as a separate test and the total scores of these two halves are assumed to be parallel (correlated together) -> USE SPEARMAN BROWN FOR SPLITHALF -Test-retest reliability -Alternate forms reliability -Interscorer and interrater reliability = all raters
multiple regression with multiple cut offs
-Multiple regression = allows for several variables to be weighted in order to predict some criterion score. -The multiple cutoff method means that the counselor must establish a minimally acceptable score on each measure under consideration, then analyze the scores of a given client or student and determine whether each of the scores meets the given criterion. -Clinical judgment and diagnosis using a test battery rely on the experiences, information processing capability, theoretical frameworks, and reasoning ability of the professional counselor. -Combining decision-making models can lead to greater accuracy.
Decision Making Using Multiple Tests
-Multiple regression allows for several variables to be weighted in order to predict some criterion score. -The multiple cutoff method means that the counselor must establish a minimally acceptable score on each measure under consideration, then analyze the scores of a given client or student and determine whether each of the scores meets the given criterion. -Clinical judgment and diagnosis using a test battery rely on the experiences, information processing capability, theoretical frameworks, and reasoning ability of the professional counselor. -Combining decision-making models can lead to greater accuracy.
Importance of reliability
-Reliability is a necessary, but not sufficient, condition in the validation process. -Unreliability shrinks the observed effect size through attenuation.
what is reliability
-Reliability means consistency, accuracy, repeatable -mostly true estimate of a person's actual ability or characteristic with only a little error contained in the test -Reliability refers to the scores obtained with a test and not to the instrument itself. -under the classical model of reliability, a criterion-related validity coefficient of test scores cannot exceed the square root of their reliability -> the reliability scores predetermines a ceiling for the validity of test's scores
Standard Error of estimate
-Standard error of estimate (SEE) = predicted value; derived from examining the difference between our predicted value of the criterion and the person's actual score on the criterion; the variance between their score and their actual performance; the smaller the variance the more validity and reliability -residual = the difference known as prediction error; same as SEE KNOW FORMULA -large SEE means not very valid bc there is so much variability -Smaller SEE means the more valid the instrument
Test-Retest Reliability
-Test-retest reliability = aka temporal stability (stability over time); is the extent to which the same persons consistently respond to the same test administered on different occasions. -correlation coefficent is also referred to a the coefficient of stability = primary source of measurement error in instability of measurement over time -give assessment to see if they are improving (how applies to our lives) -The major problem is the potential for carryover effects (practice effects, intervention, circumstances, maturation) between the two administrations. -Thus, it is most appropriate for measurements of traits that are stable across time. (get better reliability -> ex. personality, visual acuity, work values) -test-retest should be done every 2 weeks
The Correlation Coefficient
-The Pearson product-moment correlation coefficient indicates the direction and strength of a linear relationship between two variables measured on interval or ratio scales. -helps us determine how strong a relationship is between two variables -> low, medium, high, very high -r = correlation coefficient -1.0 is the strongest relationship -typically aim for r=.90 (very high correlation) to give a diagnosis -for screening you want r=.80; ex. depression screener A. .9-1.0 very high B. .7-.9 high C. .5-.7 moderate correlation D. .3-.5 low correlation E. .0-.3 very low
Classical model of reliability cont
-The classical assumption is that any observed score consists of both the true score and error of measurement. -Reliability indicates what proportion of the observed score variance is true score variance. -Reliability coefficients in the .80s are desirable for screening tests, .90s for diagnostic decisions.
Content-Related Validity
-The main focus is on how the instrument was constructed and how well the test items reflect the domain of the material being tested -> do your test items accurately reflect what you are trying to test (ex. licensing exam should be about what you have been trained in) -important in a hiring decision -> see if they are honest, it is bad if there is high face validity bc then it will be easy for them to pass and fake good; so content is strong and face validity is low Why is it important she a test is designed to have low face validity to have high content-related validity? ->bc if they do not know what they are testing then you have to make sure that it is actually testing what you want it to test -> ask a different type of question to get info about integrity but make sure that it is actually asking about integrity -This type of validity is widely used in educational testing and in tests of aptitude or achievement. -Determining the content validity of a test requires a systematic evaluation of the test items to determine whether adequate coverage of a representative sample of the content domain was measured ->can't ask about everything you learned, but make sure that the test shows a good representative of what you learned
spearman brown prophecy formula
-helps find internal reliability once the two tests are back together rxx = reliability coefficient r12 = pearson correlation between he scores on the two halves of the test rxx = 2r12/1+r12 *can increase reliability by adding more items
why is reliability important?
-if it isn't reliable why should you depend on it -temperature, measuring cups, scale, Blood Pressure
Construct validity
-measuring what you are trying to measure; need in personality, disorders, etc -Evidence for construct validity is established by defining the construct being measured and by gradually collecting information over time to demonstrate or confirm what the test measures. Construct validity evidence is gathered by (5 different ways to gather construct validity: -Convergent validity evidence = evidence is gathered by correlating the scores on a test with scores on other tests believed to measure the same or very similar constructs (high positive correlation are evidence convergent validity); take two depression tests and see if their scores are very similar or not -Discriminant validity evidence = evidence is derived by demonstrating that test scores are not highly correlated with measures of other, unrelated constructs -Developmental changes = indicate support for the construct validity of a test when the test measures changes that are expected to occur over time -ex. we may be interested in measuring the thinking processes of children in light of Piaget's model of cognitive development, a valid test would discriminate between concrete operations and formal operations and would show increased levels of formal operational thinking among young ppl as they moved from childhood to adolescence, as is expected developmentally -> older children would be expected to obtain higher raw scores on IQ tests than young children (necessary but not sufficient for establishing construct validity) -Distinct groups = can provide evidence of construct validity if their scores are different in an expected direction forms scores of ppl in other groups or the general population (ex. test to measure leadership -> expect military officers to score higher on average than rank and file soldiers) *dont focus much on factor analysis KNOW THESE
clinical judgment
-relies on experiences and not a statistical decision making -instead of test results it relies on reason judgment from observations -they determine what is important to look at and whats not important to look at -can be biased depending on what theory they use READ PAGES 156-166 to fill in the blanks -> WE DID THIS IN CLASS
Reliability of Criterion-Referenced Tests
-show how the examinees stand with respect to an external criterion -criterion is usually some specific educational or performance objective such as "can apply basic algebra rules" or "at risk for depression" -Classification consistency shows the consistency with which classifications are made, either by the same test administered on two occasions of by alternate test forms. when do you use reliability = school tests, screener (determines how much depression someone has) There are two forms: 1. Mastery versus nonmastery = observed proportion of persons consistently classified as mastery vs non 2. Cohen's k: the proportion of nonrandom consistent classifications Cohen's K = Po-Pe/1-Pe -Po = p11+p22 -Pe = chance; Pa1Pb1+Pa2Pb2 -Pa1 = P11 + P12 -Pa2 = P21+P22 -Pb1 = P11+P21 -Pb2 = P12+P22 -P11 = the proportion of persons classified as mastery by both test forms -P12 = the proportion of persons classified as mastery in form A and non mastery in form B -P21 = for nonmastery in for A and mastery in form B -P22 = nonmastery in both forms *think that the first and second digit represent the first and second test and master represents 1 and non mastery represents 2 What K numbers mean: -.41-.6 is moderate and ordinarily sufficed for research purpose; needs to be higher for diagnosis -.61-.8 = considered substantial -.8-1.0 = almost perfect -can be dangerous when classifying ppl so make sure it is valid and reliable -look at pg 140
How to reduce measurement error and improve reliability
1. writing items clearer 2. providing complete and understandable test instructions 3. administering the instrument under prescribed conditions 4. reducing subjectivity in scoring 5. training raters and providing them with clear scoring instructions 6. using heterogenous respondent samples to increase variance of observed scores 7. increase length of test by adding items that are ideally parallel to those that are already in the test
Criterion-related Validity
2 forms of criterion-related validity (concurrent and predictive) -Criterion-related validity is derived from comparing scores on the test to scores on a selected criterion. (ex. SAT -> performance in college; criterion is college performance GPA, the SAT is the predictor)***; known what the predictor is and the criterion is Sources of criterion scores include: -Academic achievement -Task performance -Psychiatric diagnosis -Observations -Ratings -Correlations with previous tests
The Interaction of Reliability and Validity
A test can never be more valid than it is reliable. -if a test score is mostly composed of testing error, it cannot be possibly be mostly composed of accurate assessment of the construct or ability in question
Alternate Forms Reliability
Alternate forms reliability counteracts the practice effects that occur in test-retest reliability by measuring the consistency of scores on alternate test forms administered to the same group of individuals. -similar to test-retest, but not the same test, but they are equivalent in what it is testing -measurability, aptitude, skills -uses coefficient of equivalence = provides an estimate of the reliability of each of the alternate forms based on item content, scorer, and temporal stability -if the correlation between alternate forms is much lower than the internal consistency coefficient (a difference of .2 or more) this might be due to (a) difference in content (b) subjectivity of scoring, © changes in the trait being measured over time between administrations of alternate forms -needs at least 10 ppl to take it -helps with practice effects (Carry over effects) like in test-retest -Check scoring subjectivity by (1) randomly split a large group of persons (2) administer the alternate forms in the same da for one group of ppl (3) administer the alternate forms after a two-week time interval for the other group of ppl
Classical assumptions
Classical test theory also proposes two additional assumptions: -The distribution of observed scores that a person may obtain under repeated independent testing with the same test is normal. -The standard deviation of this normal distribution, referred to as the standard error of measurement (SEM), is the same for all persons of a given group taking the test.
Internal Consistency - split half method
Internal consistency estimates are based on the average correlation among items within a test or scale. There are various ways to obtain internal consistency: Split-half method: -Odd-even method or matched random subsets method -> The Spearman-Brown prophecy formula must be applied (used for tests that have about the same amount of difficulty) -Cronbach's coefficient alpha used for multiscaled item response formats (likert scale) -Kuder-Richardson formula 20 used for dichotomous item response formats
Interscorer and Interrater Reliability
Interscorer and interrater reliability are influenced by subjectivity of scoring. The higher the correlation, the lower the error variance due to scorer differences, and the higher the interrater agreement. -about training the rater bc it is so subjective -the less objective, the less interrater reliability -more likely to have weak reliability bc its subjective; ex. look at behaviors such as coding behavior of aggression -> a person might see aggression differently than others -ex. projective tests -> no right or wrong answer, no standard *DO NOT NEED TO CALCULATE
standard error of measurement
SEM = square root of 1-reliablity (rxx) multiplied by SD of individuals score KNOW FORMULA -standard deviation looks at a group of other ppls scores -Standard deviation of an individual score is from the mean in an normal distribution -standard error is how far away you are from the mean with your own score -95% of the scores fall from two SD below and above the mean -95% confidence interval = your confidence that the true score falls within the interval range; interval of where the true score is = T-2 (SEM) to T +2 (SEM) -when error increases, reliability decreases -improves our accuracy of measurement *A smaller SEM (higher reliability) will produce either smaller confidence intervals for the persons true score, thus improving the accuracy of measurement *the SEM is inversely related to reliability, so hit reliability indicates high accuracy of measurements (lower SEM) -reliability coefficient is a unit less number between 0 and 1, conveniently used to report reliability in empirical studies, but the SEM relates directly to the meaning of the test's scale of measurement (raw score, z score) and is therefore more useful for score interpretations
2 forms of Criterion-related Validity
STILL CRITERION RELATED VALIDITY BUT DIFFERENCE IN TIME Two forms of criterion-related validity: Predictive criterion-related validity = test is administered first and scores on the criterion measure are collected on the sample of persons at a later date (in the future); dis. things change developmentally, ppl may drop out and can't determine if they succeed or not (attrition - lose some of the original sample, quit, get sick); not comprised by restrictive range -> delay between collection of predictor and criterion measures means that the problem the test was deigned to resolve continues for that length of time -used in = education, business, industry Concurrent criterion-related validity = when the scores on the test and criterion measure are collected at the same point in time (ex. diagnosis test, when you get the scores then you get the diagnosis with not time lag) -used in = diagnosis, job setting -restricted range with concurrent = smaller range of ppl that have a decent level of performance and use those as the sample instead of just getting a bunch of random ppl, adv = no delay and no attrition
Social Justice and Multicultural Issues
Test authors should determine whether the reliabilities of scores from different groups vary substantially, and report those variations for each population for which the test has been recommended. -test may not be appropriate if there is so much variance, which is why the sample and content items are important
multiple tests
What types of decisions can be made: -screening-level -diagnostic -placement
Sensitivity/specificity/PPP/NPP
ppl who have depression are valid acceptances ppl who have depression but don't actually have it are false acceptances -ppl who didn't screen for depression and really don't have it are valid rejections -ppl who did screen for it and do have for it are valid rejections -sensitivity = the valid acceptances of an actual diagnosis (looking at the proportion of all the ppl who are screened or diagnosed of having depression; VA/VA+FR -specificity = the valid rejection of a diagnosis (screened for not having it and do not have it - VR); VR/VR+FA *accepting or rejection deals with the screener but whether it is valid or false is the clinical diagnosis -positive predictive power = the screening test -> ppl predicting positively that they have a diagnosis -negative predictive power = the screening test ->lower likelihood to be predicted to have depression -PPP and NPP = Screener prediction -Sensitivity and Specificity = criterion and diagnosis
multiple regression formula
predict a criterion score based on a screener -multiple regression formula allows it to be weighted in each test Y = a +B1X1 (SPSS) +B2X2 + B3X3 +BIXI Y = criterion score a = slope of the line B1,2,3 = given scores BI = infinity EX. -B1X1 = multiple the SPSS number and the actual number of the fathers score a = 1.21 b1 = coefficient of all father = 0.031 b2 = 0.024 b3 = 0.017 X1,X2,X3 = actual scores y = 6.016
TEST
with exception of Cohens k, we don't need to know how to compute, but we need to know when the others are used and which type of reliability -split half method -cronbachs -kuder-ruchardson formula 20