MODULE 2 : reliability, validity, test bias (ch 4, 5, 19)

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

This is a method developed by the Educational Testing Service (ETS) that attempts to counter the adverse effects of minority status on many of their tests (ex. GRE, SATs, etc.). ---attempts to identify items that are specifically biased against any ethnic, racial, or gender group.

Differential Item Functioning (DIF) Analysis

what is differntial validity?

Differential Validity: the belief held by psychologists and psychometrics specialists that tests are differentially valid for African Americans and White people.

Absolute Agreement (categorical data)

agreed 3/5 so 60% absolute agreement hard to know if its goodd or nad (could depend on how many rating categories there are)

what is the nomological net what are we trying to do? constructs are a function of....

all the constructs associated with eg., fomo, shoukld be related to scoial anxiety, nee to impress friends, excitement seekeng = thte constructs all the observations should be indicative of these constructs tthe whole swirling space of constructs and measures should surround and form the nomological net that surrouns fomo - if some of them are not related or fail to predict then u have a problem because the network IS the validation **trying to position our cosntruct relative to the things it should be close to or far from!!! ***constructs in themselves arent anything - theyre simply a function of what theyre relative to

what is the classical test theory equation? 2 kinds of error - which types of errors does CCT focus on?

psychology measures arnt super precise so if we measured eg. extraversion - its probably nott actually what it is. what i obtaine dis not necessarily true - could be close systematic error eg. step on a scale and if its 2 pounds tto heavay then every thing u weigh wih the scale will be 2 lbs heavier - consitent error random error - normally distributed - random

test's ability to detect the presence of a condition is called ______________ if u have few false negatives this ^would be ..... test's ability to detect the absence of a condition is calle ___________ if u have few false positives this wouldd be...

sensitivity

explain how measurement error can attenuate correlattions

"Attenuate" means to reduce the value of something Correlations between 2 measures are attenuated by the reliability of those measures even tho the real correlaion is 0.6 is seems like its 0.4 its not that its a small correlation its that u have lots of measurement error everything is attenuated to the extent that ur measurements are reliable

inter-rater reliability - what is it? - more useful in which cases? - is this high in psychiatry?

A measure of consistency/agreement among raters More useful in subjective cases (e.g., diagnosis) than objective cases (e.g., math problem) Surprisingly low in psychiatry Multiple clinicians observe: ◦ Same patient ◦ Same interview questions ◦ Same video recording ◦ Different diagnoses (sometimes no agreement)

what does parallel-forms reliability assume?

Assumes that both tests measure the same construct Assumes that random halves are equivalent

standard deviation of mens hieght ex.- how do we incorperate measurment error into this?

But we are assuming perfect measurement of height But psychological measures are not perfect... they are less than perfectly reliable—they have measurement error So we are measuring height with a rubber ruler So there is some random error

what is considered the largest source of measurement error?

Content sampling is typically considered the largest source of measurement error

Two ttypes of construct validity?

Convergent Validity: Correlations with similar constructs should be high - measure of empathy should be correlatedd with siumialr things (eg. tender and loving, feeling excited for ppl etc) Discriminant Validity: Correlations with dissimilar constructs should be low (eg. shoudldnt be correlated to things that arnt

what is criterion validity? 2 types ?

Does the measure predict the thing it's supposed to predict (i.e., the criterion)? Can test scores predict performance on a criterion? (e.g., IQ predict college GPA) 1. Concurrent validity (do your IQ scores correlate with your current GPA) 2. Predictive validity (IQ scores are correlated with future GPA)

Content Validity what is it? inclues _________ and ___________

Does the test cover a representative sample of the construct being measured? E.g., A grade 9 math ability test with nothing but division items = low content validity - doesnt cover all the other things you should know in math E.g., A math ability test with many types of math questions, representative of the kind of math you should know in grade 9 Includes construct underrepresentation and construct overrepresentation (measuring eg. not only social phobia but other things - if ur reaching too much)

what are we essentially asking when were looking at validity?

Does the test measure what it is supposed to measure?

what is external validity? ____________ is an aspect of external validity, refers to whether a study's findings can be generalized to the real world

External Validity: how generalizable the findings of a study are to other people, settings, situations, time periods, etc. or is it just that it worke for that one group of ppl at tthat one ttime andd itll never wokr again (big replication cirsis in psyc cuz ppl pay more attention to internal than external) Ecological Validity, an aspect of external validity, refers to whether a study's findings can be generalized to the real world (eg. lab that looks just like a bar - helps participants forget where they are) *Note: although rigorous research methods increase internal validity, they often decrease external validity

Traditional validity nomenclature - 4 types:

Face Validity (doesn't count) Content Validity Criterion Validity Construct Validity

how can we improve reliability?

Increase the number of test items ---> composite score will be closer to true value and domain sampling will be better (more items allow us to cover the whole consturct) Develop better items using item analyses ◦ Average inter-item correlation ◦ Average item-total correlation Use good standardization procedures ◦ Explicit instructions for administration ◦ Same environment, time of day, experimenter, etc. ◦ Use consistent scoring procedures ◦ Extensive training of test users

Interpreting Validity Coefficients - what do we have to look for? 8 summary

Look for changes in the cause of relationships Looking at what the criterion means Looking at subject population in the validity study Looking at the sample size of the validity study Never confuse the criterion with the predictor (ex. students that didn't achieve the minimum GRE score and that have to achieve it by the end of their grad school degree - this makes no sense) Check for Restricted Range on Both Predictor and Criterion Review Evidence for Validity Generalization Consider Differential Prediction

when is medium consistency good?bad?

Medium internal consistency is bad for a narrow construct (panic disorder), but not so bad for a broad construct (Neuroticism)

if 2 tests= have high parallel-forms reliability, can they be used interchangeably ? what is this importnant for?

Note: If 2 tests have high parallel-forms reliability, they can be used interchangeably (e.g., as pre-test and post-tests in repeated measures designs to avoid learning/memory) E.g., 2 Verbal IQ tests, 2 Module tests

test-retest reliability - what is it? - only applies to...

Same test given to same sample on two different occasions Only applies to measure of stable traits--->test-retest reliability is supposed to estimate error associated with administering a test at different times, not true changes in the trait

what are reasons why we wouldnt want to Utilize grades instead of typical criterion as a prediction.

Teacher-assigned grades are unstandardized and open to subjective bias Few available studies have used grades as the criterion The most frequently cited study in favor of this reform is open to other explanations. In this study, the teachers rated the classroom performance of nearly all of the minority children as poor. These low ratings resulted in little variance on the criterion measure.

Adjusted Agreement and the problem of chance agreement Imagine there are 3 Raters who give yes/no ratings ◦ What happens if one of the raters doesn't care and flips a coin for her ratings? What will happen to agreement?

This is hard to figure out, but it's clear we need to correct (adjust) for guessing sometimes she guesses the same category as the other judges, boosting "agreement" Cohen's Kappa ---> too complicated here, but I can explain the issue using a multiple-choice test example....

what is time sampling error? which reliability is this reflected in?

Time sampling error reflects random fluctuations in performance over time. Includes changes in: ◦ the examinee (e.g., fatigue, illness, anxiety) ◦ the environment (e.g., distractions, temperature) Test-retest reliability estimates this type of error

1. Average inter-item correlation

(how correlated are the questions with each other, on average?) 6 items - the corelation of thes 6 tiems is the triangle 0.9 is the average interitem corrleation --> extent to which all the items hang ttogether

when one optiopn is obviously incorrect - hwo ddoes tthe guessing/chance thing work for adjustedd agreement?

- And that's if they take a wild guess - Often, one option is obviously incorrect - Therefore, guesses end up being 33% right - Therefore, expected scores 33%-100%, not 0%-100% Example: A test has 30 items, a-d Someone knows 18 answers and guesses on 12 items On average, if he can eliminate one wrong answer, he will get 4 of the 12 right (33%) His score is 22/30, but correcting for guessing would lower it down to 18/30

in the Kuder-Richardson technique what is the only situation that will make the sum of the item variance less than the total test score variance

- The only situation that will make the sum of the item variance less than the total test score variance is when there is covariance between the items.-- when the items are correlated with each other. - he greater the covariance, the smaller the pq term will be -then the items covary, they can be assumed to measure the same general trait, and the reliability for the test will be high. The other factor in the formula is an adjustment for the number of items in the test. This will allow an adjustment for the greater error associated with shorter tests.

why do things get more complicated fir absoluwe agreement if theres 3+ raters?

- if theres more than 2 raters, not abou the fact that 2/3 agree -> have to look att pairwise comparisons Pairs: ◦ Rater 1 - Rater 2: agree = 1/1 ◦ Rater 1 - Rater 3: disagree = 0/1 ◦ Rater 2 - Rater 3: disagree = 0/1 Therefore, agreement is 1/3 or 33%

what would a regression line/graph look like with an intercept bias but no slope bias? in this situation if u rew a line between both groups an int make a istinction between men and women - what mistakes woul you be making?

- same mean ddiffernece in ability but now we have a separation of the lines --> if we follow the lines until they touch the Y axis the intercept points will be ddifferent - men with 100 IQ you expect a GPA of around 2 but women with 100 IQ you expect a GPA of around 3 --->bias for women - even at the same level of intelligence as men they have a major boost in GPA - If we drew a line between them and assume everyone is the same, we will Overestimate Red GPA and Underestimate Blue GPA

random error doesnr affect ____but affets _____ systematic error doesnt affect _____but it affects _____

- systematic error just shifs - ranomdd error - the variability grows

---> Concretely, the validity coefficient squared is ...... this means that ....

---> Concretely, the validity coefficient squared is the percentage of variation in the criterion that we can expect to know in advance because of our knowledge of the test scores. This means that different validity coefficients could mean different things in different spheres.

why does regression to the mean happen? why is this bad for treatmentt stuies?

---> because ppl are more likely to report to services whej theyre more extreme then normal, or get notice when theyre performing on average better - it often LOOKS like the treatment but most of the time its just them returning to their normal - regression to the mean

corelation ranges from...

-1 to 1 anything close to 1 is a strong corelation, close to 0 = weak

accoringto Basic sampling theory-what would the distribution of random errors look like? the center of the distribution should represent the__________, and the dispersion around the mean of the distribution should display___________ how can we estimate the true score?

. Basic sampling theory tells us that the distribution of random errors is bell shaped the center of the distribution should represent the true score, and the dispersion around the mean of the distribution should display the distribution of sampling errors. we can estimate the true score by finding the mean of the observations from repeated applications.

2. Average item-total correlation

0.84 = ddoes item 1 corelate look att the bottom row and average them and say what is the average amount of corelation from each item and the total

Can measure internal consistency through 4 methods . what are they?

1. Average inter-item correlation (how correlated are the questions with each other, on average?) 2. Average item-total correlation (how correlated are the items with the total, on average?) - if we add all ur tiems up - how correlated is item 1 with the total score (if best stuents in class r getting item 1 worng and worst students are getting it right not a goodd diem) 3. Split-half Reliability (if we randomly split the items into two sets, how correlated are they?) 4. Cronbach's Alpha (what is the average of all possible split-half correlations?)

what are the main sources of measurement error?

1. Content sampling error 2. Time sampling error (^2 main ones) 3. Inter-Rater Differences

what are the environmental influences that may be causing such a disparity?

1. Social and Economic Inequality (SES) is one of the highest predictors of good performance on standardized testing. Most experts now agree that tests measure not just inborn potential but also the effects of cumulative experience. 2. Coaching and prior experience with the material

what are 3 ways we can interpret data in biased istuations?

1. i nstead of indicating genetic variations or social handicaps, differences in test scores may reflect patterns of problem solving that characterize different subcultures. 2. We an interpret the results as artefacts of varying types of intelligence. Garner suggested that there are seven distinct types of intelligence: linguistic, musical, logical-mathematical, spatial, bodily-kinesthetic, and two different forms of personal intelligence. 3. Utilizing grades instead of typical criterion as a prediction.

what are3 ways to evaluate interrater reliability?

1. record the percentage of times that two or more observers 2.The kappa statistic

what are the flaws in recording the percentage of times that two or more observers for testing interaterreliability

1. this percentage does not consider the level of agreement that would be expected by chance alone --should include an adjustment for chance agreement. (adjusted agreement^) 2.Second, percentages should not be mathematically manipulated. --For example, it is not technically appropriate to average percentages.

IQ scores before and after addoption what does this show? 2 things

3 groups al have aroun the same starting of around 70 pre addoption - performing below average when given IQ tests at 4-6 years of age 6-8years later wanted to see how they were doing in new household -->if placed into a household with : - low ses - 85 - milddle ses - 92 - high ses - 100 *even in low ses conditions theres a huge jump in IQ (10 point is huge) *conidtons of enrichement (acces to books and food etc) has a dramatic effect *average child goes from being low low iq (Almost intellectually disabled) to almsot average only with a change of household ***shows --> the dramatic effects beyond ses of living in a stable household AND the effect that SES can have - calls into wuestion any conclusions u can make on these race stuies

A construct is defined as .....

A construct is defined as something built by mental synthesis. As a construct, intelligence does not exist as a separate thing we can touch or feel, so it cannot be used as an objective criterion. This is essentially what construct-related validity is concerned with.

A decision is considered Pareto optimal when ....

A decision is considered Pareto optimal when it balances competing goals, in this case between criterion performance and ethnic or racial balance

A good way of thinking about reliability is that it represents____________ 4 types of reliability?

A good way of thinking about reliability is that it represents consistency ◦ Across questions (internal consistency reliability) ◦ Across forms of a test (parallel forms reliability) ◦ Across time (test-retest reliability) - IQ test ◦ Across raters (inter-rater reliability)

A test is unbiased if...

A test is unbiased if the scores of groups all cluster around the same regression line

A test that is designed to predict performance in a mechanical training program would show differential validity if ... ex. mechanical concpets test women vs men and how well theyll do in the progiram

A test that is designed to predict performance in a mechanical training program would show differential validity if it predicted performance much better for men than for women. Women who have traditionally had less previous experience with mechanical concepts than men may score more poorly. However, when taking the course, many women would easily acquire this information and perform well. Thus, the test would provide relatively little information about how these women would perform in the program, but it would tend to predict how men would perform. --->Its not that the scores are different, its that the predictions based on the score are very different

what is a Validity Coefficient There are rarely larger than.....and those between .....and ..... are considered high.

A validity coefficient is the relationship between a test and a criterion expressed as a correlation. There are rarely larger than 0.60, and those between 0.3 and 0.4 are considered high. A coefficient is statistically significant if the chances of obtaining its value by chance alone are quite small: usually less than 5 in 100.

when ddoes a variable have a "restricted range"? why do we have to Check for Restricted Range on Both Predictor and Criterion whemn interpretting val. coed. GRE example

A variable has a "restricted range" if all scores for that variable fall very close together. The whole point of correlations is that there must be variability in the two things you are trying to find a relationship between. An example of this would be the fact that most grad schools in psychology have really restricted range in GPA once in grad school. GRE EXAMPLE There are at least three explanations for the modest performance of the GRE for predicting graduate-school performance. First, the GRE may not be a valid test for selecting graduate students. Second, those students who are admitted to graduate school represent such a restricted range of ability that it is not possible to find significant correlations. Students with low GRE scores are usually not admitted to graduate school and, therefore, are not considered in validity studies. Third, grades in graduate school often represent a restricted range. Once admitted, students in graduate programs usually receive A's and B's.

Cultural Test Bias Hypothesis (CTBH) how does this explain group differences

According to CTBH, group differences are believed to be the result of test bias and not the result of actual differences - the validity of the CTBH is one of the most crucial scientific questions facing psychology today

Unqualified Individualism - According to this viewpoint, a test is fair if.....

According to this viewpoint, a test is fair if it finds the best candidates for the job or for admission to . school. If race or gender was a valid predictor of performance over and above the information in the test, then the unqualified individualist would see nothing wrong with considering this information in the selection process.

what is another word for criterion contamination

Also called predictor-criterion overlap - when the predictors and criteria are dependent (e.g., psychopathy predicting aggression)

why is An estimate of reliability based on two half-tests an underestimate? how do we correct for this?

An estimate of reliability based on two half-tests would be deflated because each half would be less reliable than the whole test.-test scores gain reliability as the number of items increases To correct for half-length, you can apply the Spearman-Brown formula, which allows you to estimate what the correlation between the two halves would have been if each half had been the length of the whole test

relationship between reliability and validity?

Attempting to define the validity of a test will be futile if the test is not reliable. Theoretically, a test should not correlate more highly with any other variable than it correlates with itself. Because validity coefficients are not usually expected to be exceptionally high, a modest correlation between the true scores on two traits may be missed if the test for each of the traits is not highly reliable.

refers to a situation in which changes in an outcome variable (smoking) can be thought to have resulted from some third variable that is related to the treatment that you administered

Confounding

where do sources of error come from in time behaviour observation studies? behavioral observation systems are frequently unreliable because ....

Because psychologists cannot always monitor behavior continuously, they often take samples of behavior at certain time intervals. -sources of error introduced by time sampling are similar to those with sampling items from a large domain. --can be handled using sampling theory and methods such as alpha reliability. behavioral observation systems are frequently unreliable because of discrepancies between true scores and the scores recorded by the observer.

, classical test theory uses the ______________ as the basic measure of error. Usually this is called the ___________

Because we usually assume that the distribution of random errors will be the same for all people, classical test theory uses the standard deviation of errors as the basic measure of error. Usually this is called the standard error of measurement:

Bias in ________validity is the most common criticism of using standardized tests with minority groups

Bias in content validity is the most common criticism of using standardized tests with minority groups - first place to look

Bias in criterion validity occurs when

Bias in criterion validity occurs when tests differentially predict outcomes across groups ◦ E.g., In men, IQ is correlated with GPA at r = .5 (higher on iq=higher gpa) ◦ E.g., In women, IQ is correlated with GPA at r = .2 -->they might even perform the same on each item but... the smarter u are as a man trasnlaytes into a higher GPA . ut the smarter u are as a woman doesnt translae into a higher gPA - increases in ur preitctor might not have the same increases on ur outcome *think what is this test doing cuz its not predicting outocmes in the same way for diff types of groups eg. smarter u were as a white person was correlate with job success but not for black ppl

what is "bias" concerned with? test bias reduces..?

Bias: Does the test measure what it is designed to measure, for everyone? -Therefore, test bias reduces validity for specific groups - instead of validity asking if itt measures what its suspposed tto measure it says EVERYONE - a type of validity

how can we estimate parallel-forms reliability?

Can estimate reliability by ◦ generating a large number of items, ◦ randomly splitting items to create 2 tests, ◦ administering both tests to the same group at the same time, (half the class woul do 1 first, have 2) ◦ then correlating the two tests **wantt a high correlation - both forms workoing the same way

Classical test theory assumes that the true score for an individual will .....with repeated applications

Classical test theory assumes that the true score for an individual will not change with repeated applications of the same test. Because of random error, however, repeated applications of the same test can produce different scores.

what are the common criticisms of bias in content validity

Common criticisms: ◦ Culturally biased content: Items ask for information that minorities have not had an equal opportunity to learn (e.g., opera, openess to experience - alot of it about aesthetics etc thats not availble to other cultures) ◦ Culturally biased scoring: Scoring is improper because minorities give "incorrect" answers that are actually correct in their culture ◦ Culturally biased format: The wording of items or test format is unfamiliar, increasing cognitive load and/or anxiety, reducing test time, etc. (eg. rotating something in space)

confidence intervals reflect.... as relaibility increases, SEM and confidence intervals get... this is called?

Confidence intervals reflect a range that contains the examinee's true score based on probability. As reliability increases, SEM and confidence intervals get smaller ---> precision

Construct Validity (the most confusing - goodluck in the book)

Construct Validity - Is the construct itself valid? Particularly difficult mwhen there is no definite criterion (e.g., love) More about the shape of the construct (is love a real thing? What counts? How should we measure it? What should it predict?) OR Modern definitions of construct validity include all other types of validity AKA he says : could be two things 1. is the construct itself valid - not about the measure but oes this thing even exist eg. fomo - is it a real thing (whats its shape, what counts what odesnt count) ****about if the construct is real andd what are its bounaries *** THIS IS THE TYPE ON THE QUIZ 2. umbrella temr - some scientist say its anything we talke about today - anything that is use to measure if validdity

2 core concepts of concept relate validdity?

Construct underrepresentation: the failure to capture important components of a construct. For example, if a test of mathematical knowledge included algebra but not geometry, the validity of the test would be threatened by construct underrepresentation. Construct-irrelevant variance: when scores are influenced by factors irrelevant to the construct. For example, a test of intelligence might be influenced by reading comprehension, test anxiety, or illness.

how is construct validity established?

Construct validity evidence is established through a series of activities in which a researcher simultaneously defines some construct and develops the instrumentation to measure it. Construct validation involves assembling evidence about what a test means. This is done by showing the relationship between a test and other tests and measures. Each time a relationship is demonstrated, one additional bit of meaning can be attached to the test. Over a series of studies, the meaning of the test gradually begins to take shape.

what is content sampling error? when would itt be small which two reliabilitys estimate this?

Content sampling error results from differences between: ◦ the sample of items on the test, and ◦ the total domain of items (i.e., all possible items) eg. test for neuroticism only asks about sadness - ur content isnt askin about all domains - not representaticve If test items are a good sample of the domain, content sampling error will be small Content sampling is typically considered the largest source of measurement error Internal consistency & Parallel-Forms reliability estimate this type of error - will see a relfeciton of this problem in these

what does convergent evience prove? what does discriminant eviddence prove?

Convergent Evidence When a measure correlates well with other tests believed to measure the same construct, convergent evidence for validity is obtained. Discriminant Evidence/Divergent Validation This type of evidence essentially stands as proof that the test measures something unique. Discriminant evidence indicates that the measure does not represent a construct other than the one for which it was devised.

why do we have to Review Evidence for Validity Generalization when interpeting val. coef

Criterion-related validity evidence obtained in one situation may not be generalized to other similar situations. Generalizability: the evidence that the findings obtained in one situation can be generalized—that is, applied to other situations. This is an issue of empirical study rather than judgment.

__________________remains the most commonly used reliability index.

Cronbach's alpha

what is the dominant hypothesis answering this question of why r there group diffs

Cultural Test Bias Hypothesis (CTBH)

what is the most important thing that usually threatens intrnal valididty?

cofounding

face validity

Does the test seem to measure what it's supposed to? - E.g., "I care about people" --> Empathy, high face validity - E.g., "I prefer baths to showers" ---> Empathy, low face validity - the items themselves dont mater but theyre tryin tofind any items in the world that separate high empathy group vs low empathy group - women vs men (have higher empathy and ten to prefer baths) but not a good item in terms of face validity

what is an exmaple of test fairness being used in legislation?

EEOC guidelines, for example, govern the use of testing for employee selection purposes. the 1978 guidelines made clear that the government will view any screening procedure, including the use of psychological tests, as having an adverse impact if it systematically rejects substantially higher proportions of minority than nonminority applicants.

what is an example of high SEM, big confidence interval (low precision? what is an Example of low SEM, small confidence interval (high precision):

Example of high SEM, big confidence interval (low precision): ◦ "Johnny's IQ score is 113 (95% confident his true IQ falls between 103 and 123)." Example of low SEM, small confidence interval (high precision): ◦ "Johnny's IQ score is 113 (95% confident his true IQ falls between 111 and 115). ->high confidence in your measuremnt

refers to an experimenter behaving in a different way with different groups in a study, which leads to an impact on the results of this study

Experimenter bias

why are we turning away from the classical test theory to IRT

First, classical test theory requires that exactly the same test items be administered to each person. eg.with intelligence, many of the items are too easy and some of them may be too hard. Using IRT, the computer is used to focus on the range of item difficulty that helps assess an individual's ability level. -example, if the person gets several easy items correct, the computer might quickly move to more difficult items. If the person gets several difficult items wrong, the computer moves back to the area of item difficulty where the person gets some items right and some wrong. Then, this level of ability is intensely sampled ***The overall result is that a more reliable estimate of ability is obtained using a shorter test with fewer items.

why od we have to look at what the criterion means when interpreting a val. coef.

For applied research, the criterion should relate specifically to the use of the test. Because the SAT attempts to predict performance in college, the appropriate criterion is GPA, a measure of college performance. Any other inferences made on the basis of the SAT require additional evidence.

Formula 21, or KR21, a special case of the reliability formula that does not require ..... assumes... kR21 formula usually _____ the split-half reliability

Formula 21, or KR21, a special case of the reliability formula that does not require the calculation of the p's and q's for every item - uses an approximation of the sum of the pq products—the mean test score. assumes that all the items are of equal difficulty, or that the average difficulty level is 50%. Difficulty is defined as the percentage of test takers who pass the item. **In practice, these assumptions are rarely met, and it is usually found that the KR21 formula underestimates the split-half reliability:

Some of the oldest views are the we can explain the poorer testing outcomes of minority groups through the __________Argument. - what is this argument?

Genetic Determinism Argument. --Argument: the argument that the observed races are caused by distinct gene variations that can be measured scientifically

Black-White Differences in the USA: The Primary Controversy what is the mean difference between the two groups? what happens when u take into account emographic variables (most importantly ________)what happens to the size differences?

Group differences on IQ tests have received extensive investigation. Random samples of Blacks and Whites show a mean difference of ~1.0 SD (15 IQ points) (85 for black samples, 100 for whitw) When a number of demographic variables are taken into account (most notably SES), the size of the difference reduces by nearly half to ~0.6 SD (9 IQ points) --->SES matters!

Groups of people defined on a ____________ basis (e.g., race, sex, religion, etc.) do not always show the same average scores on psychological tests

Groups of people defined on a nominal basis (e.g., race, sex, religion, etc.) do not always show the same average scores on psychological tests

what is another factor that needs to be taken into account and calls into questions these large scale race group IQ diffs?

IQ scores have been improving rapidly over the last 80 years (Flynn Effect) - average persons intelligence now is about 35 points about what they were in the 1930's - compare to ppl now, ppl in the 1900's were intelwctually isbaled

In practice, is DIF effective? SAT example

In one study, 27 items from the original SAT were eliminated because ethnic groups consistently answered them differently. Then the test was rescored for everyone. Although it seems this procedure should have eliminated the differences between the two groups, it actually had only slight effects because the items that differentiated the groups tended to be the easiest items in the set. When these items were eliminated, the test was harder for everyone.

In studies where researchers tried to identify biased test questions, were there significant improvements in the performance of those whose test scores seemed to be consistently lower? do Researchers find the same result when they try to find types of items that may be discriminatory?

In studies where researchers tried to identify biased test questions, there were no significant improvements in the performance of those whose test scores seemed to be consistently lower. Researchers find the same result when they try to find types of items that may be discriminatory.

In terms of assessment, bias is a...... what does biased measemnet od?

In terms of assessment, bias is a systematic influence that distorts measurement or prediction by test scores (not random) biased measurement systematically underestimates/overestimates the value of the variable it is designed to measure

in which of the 3 distirbutions shown would drawing conclusions on the basis of fewer observations nlikely produce fewer errors than the other(if its a wide curve or a skinny curve)

In this case, you might not want to depend on a single observation because it might fall far from the true score. The far- right distribution displays a tiny dispersion around the true score. In this case, most of the observations are extremely close to the true score so that drawing conclusions on the basis of fewer observations will likely produce fewer errors than it will for the far-left curve.

what is the error of interrater differences? also includess? which type of reliability is this reflected in?

Inter-Rater Differences ◦ When scoring is subjective, inter-rater differences can introduce error Also includes: ◦ Errors in administration ◦ Clerical errors Inter-Rater reliability estimates this type of error - can use an adjusted index of agreement such as the kappa statistic.

what is internal validitty? it also reflects... Internal validity depends largely on ______________________

Internal Validity: the extent to which a study establishes trustworthy cause-and-effect relationships It also reflects how much you've eliminated alternative explanations for a finding ◦ E.g., if you implement a smoking cessation program with a group of people, how sure can you be that any improvement seen in the treatment group is due to the treatment per se? Internal validity depends largely on rigorous study procedures

how do we evaluate Test-retest reliability One thing you should always consider is the possibility of a _______ -

Just administer the same test on two well-specified occasions and then find the correlation between scores from the two administrations One thing you should always consider is the possibility of a carryover effect - when the first testing session influences scores from the second session. ---> in cases where the changes are systematic, carryover effects do not harm the reliability.If something affects all the test takers equally, then the results are uniformly affected and no net error occurs. -Practice effects

why would knowing how groups differ in their approaches to problem solving be helpful? 2

Knowing how groups differ in their approaches to problem solving can be helpful for two reasons. 1. it can teach us important things about the relationship between socialization and problem- solving approaches. This information can guide the development of pluralistic educational programs. 2. knowing more about the ways different groups approach problems can lead to the development of improved predictors of success for minority groups.

what is the istinction between reliability and information?

Low reliability implies that comparing gain scores in a population may be problematic. --example, average improvements by schools on a statewide achievement test may be untrustworthy if the test has low reliability. Low information suggests that we cannot trust gain-score information about a particular person. --This might occur if the test taker was sick on one of the days a test was administered, but not the other. --howver l ow reliability of a change score for a population does not necessarily mean that gains for individual people are not meaningful

Most of the controversies around the SOMPA have to do with....

Most of the controversies around the SOMPA have to do with its validity, because this kind of test is hard to validate with the same methods as other tests (the creator of this test advocates for this stance). She claims that looking at what performance it predicts in the same manner is not applicable.

is the flynn effect due to new technologies and getting more access to knowlendge? does the rate of the flynn effect depend on where u are in the world?

NO! not just that ppl know more but raw fluid and spatial intelligence has increase Dramatically no - see it all across the world the rate of increase --see the same sloped increase (some may sttart earlier) no tappering off

does the fact that language used in intelligence tests such as the classic IQ Stanford-Binet scales might not be the language that African-Americans of poorer SES encountered in their life account for differences between groups? STUY

No Scheuneman 1987 100 children were given the Stanford Binet test, half of which were given a version of the test that used African American dialect. Key Finding: The results demonstrated that the advantage produced by having the test in African American dialect translates into less than a 1-point increase in test scores.

what are some ways You can assess the different sources of variation within a single test? (instead of parallel forms)

One method is to evaluate the internal consistency of the test by dividing it into subcomponents. = Split-Half Method

One of the problems with classical test theory is that it assumes that behavioral dispositions are.....

One of the problems with classical test theory is that it assumes that behavioral dispositions are constant over time. --In classical test theory, these variations are assumed to be errors

What could BITCH be useful for, given its validity and reliability? 3

One of the rationales for the test is that it will identify children who have been unfairly assigned to classes for the educable mentally retarded (EMR)1 on the basis of IQ scores. -->Preliminary tests to determine whether or not this is a likely useful use of the BITCH have shown very little to no reclassification of children who were assigned to lower-level courses. In its present state, the BITCH can be a valuable tool for measuring white familiarity with the African American community. When white teachers or administrators are sent to schools that have predominantly African American enrollments, the BITCH can help determine how much they know about the culture. Furthermore, the BITCH can help assess the extent to which an African American is in touch with his or her own community.

One form of reliability analysis is to determine the error variance that is attributable to the selection of one particular set of items. -- what is this called?

Parallel forms reliability compares two equivalent forms of a test that measure the same attribute. The two forms use different items; however, the rules used to select items of a particular difficulty level are the same. When both forms of the test are given on the same day, the only sources of variation are random error and the difference between the forms of the test - when on diff days - error associated with time sampling is also included in the estimate of reliability.

what is reliabilitys relationship to validity?

Reliability is a necessary, but insufficient, condition for validity (example) ---->for something to be valid it must be reliable (needs sufficient reliability) but thats to enough for it to be valid For interpretation of scores to be valid, test scores must be reliable. However, reliable scores do not guarantee valid score interpretations.

is BITCH a reliable test? is it valid?

Reliability studies have actually shown that this test is not bad at all. It is as reliable as other standard tests. not valid - The difficulty is that one cannot determine whether the BITCH predicts how well a person will survive on the streets or how well he or she will do in school, in life, or in anything else.

can have effects like learning and fatigue specific to the test

Repeated testing

so... what are the things weveseen so far that could explain how ddifferences in IQ between races, geners etc. arent real - differences start to dissapear IQ changing rapidly over time suggest that a ______explanation is weak

SES explained about half the race difference Stereotype threat and other factors can explain much of the rest IQ changes rapidly over time and in response to different environments; genetic hypothesis is weak

Use of Quotas Selection procedures are regarded as biased if... places less emphasis on..

Selection procedures are regarded as biased if the actual percentage of applicants admitted differs from the percentage in the population; each group should demonstrate a fair share of the representation This fair-share process places less emphasis than does testing on how well people in the different groups will do once selected.

by applying the______________one can estimate how many items will have to be added in order to bring a test to an acceptable level of reliability.

Spearman-Brown prophecy formula,

refers to following specific procedures for the administration of a treatment so as not to introduce any confound

Standardized protocol

what is a cultural factor that could affect IQ scores how does this work

Stereotype Threat: When people worry about confirming a stereotype about their group - Happens even when people don't believe the stereotype about their group - Anxiety about not confirming stereotype --> lowers performance - Makes certain activities punishing --> avoidance of those activities --> lower ability --> self-fulfilling - happens with black students, women in math etc.

READDINGS: (dont need to know- p. 121: Summary of Guidelines for Reliability - p. 533: Different Models of Test Fairness) Tests that are relatively free of measurement error are deemed to be_________

Tests that are relatively free of measurement error are deemed to be reliable

CHAPTER 5 The _________ created the standards for psychological testing and validity: validity is ....

The American Psychology Association (APA) created the standards for psychological testing and validity: validity is the evidence for inferences made about a test score. Validity: evidence that a test used for the selection or promotion of employees has a specific meaning

The__________ was designed to prove that differences in scores were a result of ignorance, and not lack of inherent intelligence, and its purpose is demonstrate that there is a body of information about which the white middle class is ignorant. does this test have validity?

The Chitling Test It is worth noting that this test has no proven validity beyond face validity. The creators describe this test as half serious, and more as a method to prove a point than anything else.

in what situations is the KR20 fomrula not appropriate? what then do we use?(the the most general method of finding estimates of reliability through internal consistency.) what is the main diff between this and kg?

The KR20 formula requires that you find the proportion of people who got each item "correct." There are many types of tests, though, for which there are no right or wrong answers, such as many personality and attitude scales eg.if theres a continuum between agreement and disagreement. - a more general reliability estimate= coefficient alpha, or a. - The only real difference is the way the variance of the items is expressed. Actually, coefficient alpha is a more general reliability coefficient than KR20 because Si2 can describe the variance of items whether or not they are in a right-wrong format.

The SEM and confidence intervals should remind us that...

The SEM and confidence intervals should remind us that scores are not perfect

how high a reliability coefficient must be before it is "high enough."

The answer depends on the use of the test. estimates in the range of .70 and .80 are good enough for most purposes in basic research. ome people have argued that it would be a waste of time and effort to refine research instruments beyond a reliability of .90. reliabilities greater than .95 are not very useful because they suggest that all of the items are testing essentially the same thing and that the measure could easily be shortened. In clinical settings, high reliability is extremely important. --Thus, a test with a reliability of .90 might not be good enough. For a test used to make a decision that affects some person's future, evaluators should attempt to find a test with a reliability greater than .95.

The correlation between ELPs and school achievement is approximately ________whereas the correlation between the WISC-R and school achievement is near ____ Mercer refuted these critics by arguing that.... why has the SOMA lost a lot of its influence?

The correlation between ELPs and school achievement is approximately .40, whereas the correlation between the WISC-R and school achievement is near .60. Mercer refuted these critics by arguing that the test is not designed to identify which children will do well in school but to determine which children are mentally retarded. Because of the fact that the consequences of identifying less minority children as mentally retarded are not clearly positive, the SOMPA has lot a lot of its influence.

The Black Intelligence Test of Cultural Homogeneity (BITCH) The creators of this test sought out to create a ......

The creators of this test sought out to create a survival quotient (SQ) for African Americans that would, instead of testing their intelligence, test their ability to survive and thrive within their own community. This test asks respondents to define 100 vocabulary words relevant to African American culture. The words came from the Afro-American Slang Dictionary and from Williams's (the creator of the test) personal experience interacting with African Americans. African American people obtain higher scores than do their white counterparts on the BITCH. When Williams administered the BITCH to 100 16- to 18-year-olds from each group, the average score for African American subjects was 87.07 (out of 100). The mean score for the whites was significantly lower (51.07).

The _____________model is another central concept in classical test theory. whaty does this model do?

The domain sampling model is another central concept in classical test theory. -->considers the problems created by using a limited number of items to represent a larger and more complicated construct. eg. attempting to evaluate is how well you can spell, which would be determined by your percentage correct if you had been given all the words in the English language. This percentage would be your "true score. we have to use a sample -->" Our task in reliability analysis is to estimate how much error we would make by using the score from the shorter test as an estimate of your true ability.

what are 2 thingswecan dod when wehave a testwith low reliability?

Two common approaches are to increase the length of the test and to throw out items that run down the reliability. Another procedure is to estimate what the true correlation would have been if the test did not have measurement error.

The _______ is the best method for assessing the level of agreement among several observers. -Kappa indicates... Values of kappa may vary between....and....

The kappa statistic (cohens kappa) -Kappa indicates the actual agreement as a proportion of the potential agreement following correction for chance agreement. Values of kappa may vary between 1 (perfect agreement) and 21 (less agreement than can be expected on the basis of chance alone).

why do we have to Look for changes in the cause of relationships when interpreting validty coeficients

The logic of criterion validation presumes that the causes of the relationship between the test and the criterion will still exist when the test is in use. Though this presumption is true for the most part, there may be circumstances under which the relationship changes.

how can The ratio of true score variance to observed score variance ? explain how we would find the percentage of variation attributatble to random eror Suppose you are given a test that will be used to select people for a particular job, and the reliability of the test is .40. .... what percent ofvariation willbeexplained by real difference?

The ratio of true score variance to observed score variance can be thought of as a percentage. In this case, it is the percentage of the observed variation that is attributable to variation in the true score - nif we subtract this ratio from 1.0, then we will have the percentage of variation attributable to random error. Suppose you are given a test that will be used to select people for a particular job, and the reliability of the test is .40. When the employer gets the test back and begins comparing applicants, 40% of the variation or difference among the people will be explained by real differences among people, and 60% must be ascribed to random or chance factors.

The reason for gathering criterion validity evidence is that......

The reason for gathering criterion validity evidence is that the test or measure is to serve as a "stand-in" for the measure we are really interested in. In the marital example, the premarital test serves as a stand-in for estimating future marital happiness.

The reliability coefficient is....

The reliability coefficient is the ratio of the variance of the true scores on a test to the variance of the observed scores:

The scores are therefore are for ________ called ________

The scores are therefore are for socioeconomic background, called estimated learning potentials (ELPs).

why o we have to Look at the sample size of the validity study when intepretting val. coef.

The smaller the sample, the more likely chance variation in the data will affect the correlation. Thus, a validity coefficient based on a small sample tends to be artificially inflated.

The standard error of measurement allows us to estimate....

The standard error of measurement allows us to estimate the degree to which a test provides inaccurate readings; that is, it tells us how much "rubber" there is in a measurement --The larger the standard error of measurement, the less certain we can be about the accuracy with which an attribute is measured

wha are some of the prpblems withsplit half technique and what technique avoids these?

The two halves may have different variances. The split- half method also requires that each half be scored separately, possibly creating additional work. The Kuder-Richardson technique avoids these problems because it simultaneously considers all possible ways of splitting the items.

why do we have to look at subject population in the validity study when interpreting val. coef.

The validity study might have been done on a population that does not represent the group to which inferences will be made.

why is it a bi stretch to call face validit?

These appearances can help motivate test takers because they can see that the test is relevant. Otherwise, it;s a big stretch to call this validity because it offers no empirical evidence for its relevance.

the reliability of the observers -what are the diff reliability estimates? -all of these consider...

These reliability estimates have various names, including interrater, interscorer, interobserver, or interjudge reliability. ll of the terms consider the consistency among different judges who are evaluating the same behavior

how is time in between testts impoertnant in test-retest reliability?

Time in between is important: ◦ the shorter the gap, the higher the test-retest correlation (less intervening variables and chance for true change) - eg. personality chnages after a year ◦ But carryover/practice effects are also stronger

since we cant know the true score-how can we estimatereliability?

To estimate reliability--we can create many randomly parallel tests by drawing repeated random samples of items from the same domain spelling example, we would draw several different lists of words randomly from the dictionary and consider each of these samples to be an unbiased test of spelling ability. Then, we would find the correlation between each of these tests and each of the other tests. The correlations then would be averaged

what is concurent evience (criterion validity) useful for?

We know that this type of evidence is more useful for blue color jobs, for which a sample (for example) of a person's work may be enough to estimate how well they might do in a related job. Another use of concurrent validity evidence occurs when a person does not know how he or she will respond to the criterion measure. For example, suppose you do not know what occupation you want to enter (ex. Interest Inventories).

When measurement is inconsistent, we call it____________

When measurement is inconsistent, we call it measurement error - Measurement error happens everywhere, but particularly psychology hard science = physics - small measuremnt errors soft science = psychology - introduce more an dmore measurement error cuz htyere rlly complex systems

When the prophecy formula is used, certain assumptions are made that may or may not be valid..what are they?

When the prophecy formula is used, certain assumptions are made that may or may not be valid. One of these assumptions is that the probability of error in items added to the test is the same as the probability of error for the original items in the test. However, adding many items may bring about new sources of error, such as the fatigue associated with taking an extremely long test.

how does split-half reliability work? if the items get progressively more difficult, then you might be better advised to use the ___________

a test is given and divided into halves that are scored separately. The results of one half of the test are then compared with the results of the other. If the items get progressively more difficult, then you might be better advised to use the odd-even system- whereby one subscore is obtained for the odd-numbered items in the test and another for the even-numbered items.

Differential Process Theory - what is it?

a theory that maintains that different strategies may lead to effective solutions for many types of tasks

The System of Multicultural Pluralistic Assessment (SOMPA) - designed base on what?

based on the idea that the feedback given to the minority groups is not that they are ignorant about the rules for success in another culture (just as the dominant group would be in a minority culture) but that they are stupid and unlikely to succeed. --->Mercer emphasized that one must take into consideration that people work from different bases of knowledge.

why is It hard to link different kinds of bias to specific types of validity? but different kids of bias can be loosely associated with...3

because these types of validity are not all that distinct... ◦ content validity ◦ criterion validity ◦ construct validity

in a study refers to participants being unaware of their intervention;

blinding

can we partition the variance to relfect sources of variance (ie. the diff ttypes of errors)

can identify how much of the variability is actual true variability an dhow much is it due to other types of error

the test-retest method is only useful on...

charactersitics or traits that dont change over time the diff between scores on varying trait scould reflect one of two things: (1) a change in the true score being measured or (2) measurement error. --Clearly the test-retest method applies only to measures of stable traits.

this model conceptualies reliability as...

conceptualizes reliability as the ratio of the variance of the observed score on the shorter test and the variance of the long-run true score. -->As the sample gets larger, it represents the domain more and more accurately. he greater the number of items, the higher the reliability.

This is no longer thought of as a truly distinct form of validity. It is, however, the only form of validity other than face validity that is logical rather than statistical.

content relayed validity

measurement theory does allow one to estimate what the correlation between two measures would have been if they had not been measured with error. - this is called.. to use these methods, one needs to know only...?

correction for attenuation To use the methods, one needs to know only the reliabilities of two tests and the correlation between them.

ome applications of psychological testing require a difference score, which is created by.... comparisons between two different attributes are be- ing made, one must make the comparison in______

created by subtracting one test score from another. - This might be the difference between perfor- mances at two points in time—for example, when you test a group of children before and after they have experienced a special training program. - may be the difference between measures of two different abilities, such as whether a child is doing better in reading than in math. ewhenever comparisons between two different attributes are be- ing made, one must make the comparison in Z, or standardized, units

how would we find convergent validity for criterion reference tests

criterion-referenced tests would compare scores on the test to scores on other measures that are believed to be related to the test.

In the last few decades, the issue of ___________ has been a major contemporary concern (mostly focused on____ andd _____)

cutlural bias - Has become a legal and political issue - Full objectivity is difficult, even in science, but scientists should attempt to minimize bias completely

why do we use revere key items?

cuz if someones not reading all the items theyre just putting agree agree agree this keeps ppl on their toes

The social-system component attempts to ...

determine whether a child is functioning at a level that would be expected by social norms. Mercer has emphasized that the social-system approach is narrow because only the dominant group in society defines the criteria for success.

Pluralistic The pluralistic component of the SOMPA recognizes that

different subcultures are associated with different life experiences. Only within these subgroups do individuals have common experiences. Thus, tests should assess individuals against others in the same subculture.

refers to researchers also being unaware of the intervention being administered

double blinding

Classical test score theory assumes that ... isthis the case?

each person has a true score that would be obtained if there were no errors in measurement. no- score observed for each person almost always differs from the person's true ability or characteristic. -->The difference between the true score and the observed score results from measurement error.

Qualified Individualism

embraces the notion that one should select the best-qualified people. Qualified individualists, however, recognize that although failing to include group characteristics (race, gender, and religion) may lead to differential accuracy in prediction, this differential prediction may counteract known effects of discrimination.

. To ensure that the items measure the same thing, two approaches are suggested:

factor analysis discriminability analysis. examine the correlation between each item and the total score for the test. - if the correlation between the performance on a single item and the total test score is low, the item is probably measuring something different from the other items on the test.

what does the regression line allow us to do?

if theres a strong corelation between how smart and ur GPA - we can use the regression line to make preictions on ur gpa based on ur iq eg. 90 IQ= 2.0

is it good to have high internal reliability?

if u have suer high internal reliability (0.98) it suggests ur essentially asking the same q over and dover again - redunant questions (but u dont want low internal - u want meium)

what is the content domain

if ur trying to cover the - want to make usre ur asking items relevant to the construct (as opposed to irrelevant things like showers, chocolate) - want to get somehwre in the construct content domain = the imaginary perimeter tthat yiur defineiion creates around a construct and items should be withijn that definition

3. Split-half Reliability

if we randomly slit the 6 itmes into two grous of three and we can see if the scores on the top corelate wit the scores on the ottom can od tthis over and over and over again

in a difference score, _____is expected to be larger than either _____score or _____because... ______might be expected to be smaller than ______ because ..... as a reuslt.. the reliability of a difference score is expected to be..... If two tests measure exactly the same trait, then the score representing the difference between them is expected to have a reliability of ______

in a difference score, E is expected to be larger than either the observed score or T because E absorbs error from both of the scores used to create the difference score. T might be expected to be smaller than E because whatever is common to both measures is canceled out when the difference score is created. ***the reliability of a difference score is expected to be lower than the reliability of either score on which it is based. -->occurs in all cases except when the correlation between the two tests is 0. ---> If two tests measure exactly the same trait, then the score representing the difference between them is expected to have a reliability of 0. (cuz the reliability of the differnece between them is nothing- they are not reliablyidff)

what is adjacent agreement?

in this case - as long as ur within 1 point of eachother its an agreement becomes subjective tho

study on stereotype threat done inthe 90's - instruction different but everything else is the same what did they find? what is this called?

instruction different but everything else is the same - women did worse an men id better when the isntructions sai it usually produces a gener iff (men assume it means men>women women assume it means women<men) = stereotype lift - men doing better

what are the types of validditys associatedd with STUDIES what are the types of validditys associatedd with MEASURES

internal and external validity = studies ◦ Face Validity, Content Validity, Criterion Validity, Construct Validity = measures

Perhaps the most important new development relevant to psychometrics is _______theory. Most of the methods for assessing reliability depend on ______theory

item response theory. classical test theory

what is the "average" error called? High reliability = Low reliability =

just like the "average" deviation is called standard deviation, the "average" error is called standard error of measurement (SEM) high reliability = low SEM Low reliability = high SEM

Medical Component The rationale for the Medical Component is that ....

medical problems can interfere with a child's performance on mental measures and in school.

In summary, have studies supported the popular belief that items have different meanings for different groups?

no - however, people must continue to scrutinize the content of tests.

The method of ___________-provides one of the most rigorous assessments of reliability commonly in use.

parallel forms

test-retest reliability example - why is this imporntant clinically?

people can change and u cant rlly preddict someones perosnality all that well by knowing it 10 years eaerlier -> hard to separate true chnage and error - if ur trying to predict something clinically this is so impornant - if ur measure has low retest reliability, risk assesment coul dbe usueflles - hard to predic things

when they pitted stereotypes against eachother what happened?

pitted stereotypes against eachother - old stereotype that are worse on math tests and asian ppl do better no reminders = baseline performace of 50% asian reminder = boost 10% = stereotype lift female reminder = down 10% = stereotype threat

A major assumption in classical test theory is that errors of measurement are ________

random -deals with random measuremnt errors not systematic although systematic errors are acknowledged in most measurement problems, they are less likely than other errors to force an investigator to make the wrong conclusions

what does it mean to Consider Differential Prediction when interpeting val. ceof.

relationships may not be the same for all demographic groups (ex. women VS men). Under these circumstances, separate validity studies for different groups may be necessary.

is there a bias in the commonly used combined regression equation used for SAT scores in minority an dmajority groups? this is called..?

significantly different regression lines were found for different groups. --The commonly used combined regression equation overpredicts how well minority students will do in college and underpredicts the performance of majority group students. In other words, it appears that the SAT used with a single regression line yields biased predictions in favor of minority groups and against majority groups. -> this is called INTERCEPT BIAS

first two books around reliability>

spearman-"The Proof and Measurement of Association between Two Things. thornddike- An Introduction to the Theory of Mental and Social Measurements (1904).

what would high internal consistent reliability look like?

strongly disagree to agree construct I really like chocolate I like chocolate ice cream more than vanilla Chocolate tastes better than most candy I hate chocolate (still consistent construct, just reversed in direction = REVERSE . KEYED ITEM) **if u like chocolate ur answers shoulD all Go in the same direction and be similar to eacother **i hate choclate is an examle of a reverse keyed item - still consistent cuz u could put disagree

The basic point of divergence between the SOMPA and earlier approaches to assessment is that ....

the SOMPA attempts to integrate three different approaches to assessment: medical, social, and pluralistic.

what would a regression line/graph look like with an slope bias but no intercept bias? what is this called?

the corelation between mens IQ an GPA would be weaker biut for women, GPA is very responsive to IQ scores - for both men and women at 0 theres no idff but when u look at 100 IQ - GPA only goes up a tiny bit for men (up to 1.3) and alot for women (up to 2.5) Thus, IQ is more predictive of GPA for Blue people than for Red people ! this is called Differential Criterion Validity

ll of the measures of internal consistency evaluate ... _________is one popular method for dealing with the situation in which a test apparently measures several different characteristics

the extent to which the different items on a test measure the same ability or trait factor analysis is one popular method for dealing with the situation in which a test apparently measures several different characteristics - divide the items into subgroups, each internally consistent --->When factor analysis is used correctly, these subtests will be internally consistent (highly reliable) and independent of one another.

the ____________around the regression line is used to encircle a specified portion of the cases that constitute a particular group.

the isodensity curve

the larger the standard error of measurement, the larger the ______

the larger the standard error of measurement, the larger the confidence interval. - When confidence intervals are especially wide, our ability to make precise statements is greatly diminished.

what is the naural tension b/w internal and external validdity?

the more rigourous a study itt is (the more controlledd ect. ) th higher your internal validity is but the lower your extenral validdity is

the most useful index of reliability for the interpretation of individual scores is the _____________

the most useful index of reliability for the interpretation of individual scores is the standard error of measurement. This index is used to create an interval around an observed score.The wider the interval,the lower the reliability of the score. Using the standard error of measurement, we can say that we are 95% confident that a person's true score falls between two values.

whatt are reliability coefficients?

the squiggles represent variance -this is the same thing but applying to a group of ppl isntead of just one person! - varying = varrying acorss individdual

the standard deviation tells us s....... The standard error of measurement tells us..... tin practice, the ________ and the __________ are used to estimate the standard error of measurement.

the standard deviation tells us something about the average deviation around the mean. The standard error of measurement tells us, on the average, how much a score varies from the true score.

There are two approaches to dealing with test bias in regards to minority groups...

those who think that minority testing should be banned and those who think that our testing methods should simply be adapted.

what would a regression line/graph look like with no intercept bias and no slope bias do mean differences mean anythign in this situation?

two groups, one performs better than other group eg. blue = women - average women has 110 iq eg. red = men - avegrage man has 95 **becasue theyre both on the same line - increases in IQ yield the same increases in GPA for both groups (corespondance between iq and gpa is identical) - some spread but there are mean ifferences - but these mean siffernces dont tell us anything about criterion bias because its about making preictions

In an attempt to define Test bias;, Hunter and Schmidt (1976) identified three ethical positions that set the tone for much of the debate... These positions focus on ....

unqualified individualism, the use of quotas, and qualified individualism. These positions focus on the use of tests to select people either for jobs or for training programs (including college).

Although reliability estimates are often interpreted for individuals, estimates of reliability are usually based on observations of ....

usually based on observations of populations, not observations of the same individual. ex.it appears impossible to make a meaningful interpretation of the difference between scores on the same children that are taken at the beginning and at the end of a school year.

what doesA well-evaluated test do (in terms of time intervals)

well-evaluated test will have many retest correlations associated with different time intervals between testing sessions. you also should consider what events occurred between the original testing and the retest. For example, activities such as reading a book, participating in a course of study, or watching a TV documentary can alter the test-retest reliability estimate.

4. Cronbach's Alpha

what is the average of every possible correlation gives u an indicattion of how well all the items go togther if measures are good - an alpha is above 0.7 if u have a lot of items on a measure then ur alpha will go up

if IQ is suposed to be stabel through an indivuals life, why is can IQ change so quickly (flynn effect, adoption)? what are the 6 possible explanations

●Increases too rapid to be explained by genetics -->no way were evolving andd selecting for intellgicnce that quickly (no way its selection pressures) Increases in all countries, all races ● it could be Better education? Test familiarity (he thinks this is weak - cuz its Gf thts going up)? ● Better nutrition? Famine studies (iodine) --> low IQ (have a real impact) ● More stimulating environments? (see the effect of enrichment on adoption studies - high ses =more enrchiment = hogher iq) ● Infectious disease? ) ---States with higher I.D. -->low IQ ---Pregnancy --> if ill, body transfers resources to mother ● Less incest/inbreeding? Unlikely, although consanguinity reduces IQ by 20-30 points and first-cousin marriages were more common at one point (and certain countries have more incest an these places have way lower rates of IQ - sometimes by 30-40 points) - flynn believe this

formal definition of Content Bias: whats one way of flagging items that might be giving content bias?

◦ "An item is biased in content when it is more difficult for one group than another, when the general ability level of the groups is held constant, and no reasonable theoretical rationale exists to explain group differences." ~Reynolds, 1998 not just that one group ddoe worse - they actually have the same ability and one group does worse (theres ways to see this) Differential Item Functioning (DIF) examines this

corn example - Are differences in height of the corn really due to type? when woul we be able to know this?

◦ 2 types of corn --> different height? ● Type A is planted on plowed, fertilized ground ● Type B is planted on rough ground ● Are differences in height really due to type? **** Can only know when all conditions are equal (i.e., when both types are planted on both kinds of ground) - this happens alot when we see outocmes andd infer ifferences that we shouldnt

Adjusted Agreement and the problem of chance agreement - cohens kappa --->multiple choice test example

◦ A test has questions with four responses (a, b, c, d) ◦ People who don't know an item guess 25% right ◦ Therefore, expected scores 25%-100%, not 0%-100% *some stanardized tests take geussin into account and reduce your score ◦ Example: A test has 30 items, a-d Someone knows 18 answers, but completely guesses on the other 12 items On average, they will get 3 of these right (25%) So their score will be 21/30, even though their real score should be 18/30 But to correct for guessing, change 21/30 to 18/30 *some tests reduce the scores

what are 8 Factors that THREATEN Internal Validity

◦ Confounding : refers to a situation in which changes in an outcome variable (smoking) can be thought to have resulted from some third variable that is related to the treatment that you administered ◦ Historical events : may influence the outcome of studies that occur over a period of time. Examples of these events might include a change in political leader or natural disaster ◦ Maturation effects : refers to changes over time. If a study takes place over a period of time in which it is possible that participants naturally changed in some way (grew older, changed habits), then it may be impossible to rule out whether effects seen in the study were simply due to the effect of time ◦ Repeated testing : can have effects like learning and fatigue specific to the test ◦ Instrumentation : refers to the impact of the actual testing instruments used in a study on how participants respond (e.g., penile plethysmography) ithe isnturment oitfself can threaten intenral validtiy ◦ Regression to the mean : refers to the natural effect of participants at extreme ends of a measure becoming less extreme over time rather than the effect of an intervention (e.g., "sophomore slump") - the most extreme values at time 1 are likely to become more average at time 2 --->why? of u have something rly extreme on the ends of a distribution the chances are that the true vlaue of the person is actually lower than that 9eg. hadd an amzoing first soccer year cuz rlly talented but also just got lucky - softmore slump in year 2, got excite dover nothing - luck went away) ---> because ppl are more likely to report to services whej theyre more extreme then normal, or get notice when theyre performing on average better - it often LOOKS like the treatment but most of the time its just them returning to their normal - regression to the mean ◦ Attrition : refers to participants dropping out or leaving a study, which means that the results are based on a biased sample of only the people who stuck with the program (e.g., bullying programs) - most commited ppl stay in and it looks like its working ◦ Experimenter bias : refers to an experimenter behaving in a different way with different groups in a study, which leads to an impact on the results of this study

3 Factors That Threaten External Validity

◦ Situational features : refer to the situation in which some feature of the particular situation was responsible for the effect, leading to limited generalizability (e.g., time of day, location, researcher characteristics) ◦ Sample features :refer to the situation in which some feature of the particular sample was responsible for the effect, leading to limited generalizability of the findings (e.g., motivational participant in smoking cessation study) ◦ Selection bias : refers to the problem of differences between groups in a study that may relate to the independent variable. Get around this with matching participants or using large samples and random selection, assignment

what are the problems with absolue agreement? and solutions to each

◦ Unintuitive Education fixes this ◦ Difficult to calculate with many raters Computers fix this ◦ Absolute agreement is an unforgiving standard, particularly with many categories (e.g., 0-100 scale) (eg. a ratte of 15 and 16 are close but not the same) Can use adjacent agreement approach ◦ Does not take chance agreement into account Can use adjusted agreement approach

Two major ways of estimating inter-rater reliability - with categoires? 3 - white continuous?

◦ With Categories: Absolute agreement (exact matching) Adjacent agreement (close matching) Adjusted agreement (corrected for chance) ◦ With Continuous data: Intra-Class Correlations

5 Issues in test-criterion studies (issues with criterion validiy)

◦ Criterion selection ---> . which is the "right" criterion? How good does a criterion have to be (e.g., consider reliability) (eg. what should IQ predict?) hard to know which criteria u should select ◦ Criterion contamination ---> when the predictors and criteria are dependent (e.g., psychopathy predicting aggression). Also called predictor-criterion overlap (eg. psychopathy predicts violence but this may be a . ircular arguement) ◦ Incremental validity --> Does the test improve prediction over other tests we already have? Does it add anything? Do we really need it? (have to show that ur measure is necessary and has an incremental contrubution) ◦ Is the test sensitive & specific? Sensitivity: test's ability to detect the presence of a condition (high sensitivity = few false negatives). Lobster-garbage example. - a sesnitve net would capture all the lobsters on the floor but also catches a bunch of junk Specificity: test's ability to detect the absence of a condition (high specificity = few false positives). - the specific net would only catch lobster and no junk but u missed half the lobsters ◦ Are the findings generalizable? Across settings (generalizability)? Across cultures (cultural validity)? - different cultures may have profund effects eg. depression - vegetative symptoms an low positive emotions (misery symptoms) etc. in western cultures but in eastern countries soemtimes it can only be vegetative symtpoms (without the msiery symptoms) - so a battery of ddepression may not have cultural validity there Across time (temporal validity) - across time things change eg. masculinity measure to see if someones gay - lost its temporal validity

how does Differential Item Functioning (DIF) examine if two groups actualyl have the same ability and are just perfomring worse because if content bias?

◦ DIF takes two groups (e.g., men, women) with equal scores on a test (500 verbal ability) and determines whether any of the verbal items differs significantly - can go through eahc item an see if theres a difference on perfomrance - if theres an item where all men do better than women than thats strange cuz they all have the same ability)

Major question - why do these groups not show the same average scores on psychological tests? - what are the hyptohesis we will look at

◦ Differences in genes? ◦ Differences in environment? ◦ Gene-environment interactions and correlations... ◦ Cultural differences? ◦ Systematic test bias?

refers to manipulating an independent variable in a study (e.g., assigning smokers to an exercise cessation program) instead of just observing an association without conducting any intervention (examining the relationship between exercise and smoking behavior)

◦ Experimental manipulation

may influence the outcome of studies that occur over a period of time. Examples of these events might include a change in political leader or natural disaster

◦ Historical events

what would Medium internal consistency reliability look like?

◦ I have anger problems ◦ I often feel self-conscious ◦ I cry more than most people ◦ I have a difficult time calming down when I'm upset ◦ I worry a lot ◦I am able to control my emotions (reversed) - these are all for neuroticism - general ttendency for feeling negative emoions

what would a constructt look like that has Low internal consistency reliability

◦ I really like chocolate ◦ I am afraid of spiders◦ Justin Bieber would be a decent uncle ◦ I would like to be a plumber *these arent related to eahcother so a score of eg. 3/4 wouldnt tell u anything -- not consistent with a partticular construct!!]

3 Factors That Improve External Validity

◦ Psychological realism : refers to making sure that participants are experiencing the events of a study as a real event and can be achieved by telling them a "cover story" about the aim of the study. Otherwise, participants might behave differently than they would in real life if they know what to expect or know what the aim of the study is. - not necessarily deceptive but we wanna get them in tthe mindset that udd participate the way u woul din the real world ◦ Replication : refers to conducting the study again with different samples or in different settings to see if you get the same results. ◦ Calibration: refers to using statistical methods to adjust for problems related to external validity. For example, if a study had uneven groups for some characteristic (such as age), reweighting might be used. - eg. weighting population by differences

of participants refers to choosing your participants at random or in a manner in which they are representative of the population that you wish to study.

◦ Random selection

what are 5 factors that IMPROVE internal validity?

◦ Random selection of participants : refers to choosing your participants at random or in a manner in which they are representative of the population that you wish to study. ◦ Randomization : refers to randomly assigning participants to treatment and control groups, and ensures that there is not any systematic bias between groups. (eg. if u assign the first ppl who singe dup to one group they might be the keen ones) ◦ Blinding : in a study refers to participants being unaware of their intervention; ◦ double-blinding refers to researchers also being unaware of the intervention being administered ◦ Experimental manipulation : refers to manipulating an independent variable in a study (e.g., assigning smokers to an exercise cessation program) instead of just observing an association without conducting any intervention (examining the relationship between exercise and smoking behavior) - without experimental manipulation ur looking at correlations ◦ Standardized protocol : refers to following specific procedures for the administration of a treatment so as not to introduce any confounds.

refers to randomlyassigning participants to treatment and control groups, and ensures that there is not any systematic bias between groups.

◦ Randomization

what are 4 badd things that measurement error does?

◦ Reduces the usefulness of measurement ◦ Reduces our ability to generalize test results ◦ Reduces the confidence we have in test results ◦ Attenuates correlations* - reduce correlations

MODULE 2 : reliability, validity, test bias (ch 4, 5, 19)

Ensembles d'études connexes

chem 2 chapter 5

ITIL 4 Foundations

International Politics Chapter 6 HR

doc & form final module 6 test

Stats

Respiratory (UWORLD)

CST 221

RAD 260 LAST FINAL EXAM!!!!!!

Groups Final Corey & Corey

Marketing 3350 exam 3

Ch. 17 and 18

Environmental Sociology

Econ 195

BIO 113 Final

Chp 8 Dr. Gilmore

ch.21 plzzz get a C in bio

Illinois dmv question

Heme First Exam Quizzes

Biology LabSmart Chapter 14

stoich test 1 and 2