Psych Assess - Exam III
Domain Sampling
(1) A sample of behaviors from all possible behaviors that could be indicative of a particular construct; (2) a sample of test items from all possible items that could be used to measure a particular construct Items must represent the construct (i.e. Romantic satisfaction)
Sources of Error
-test construction: Use domain sampling -test administration: Uneven tables, loud noise, uncomfortable temperature -test scoring and interpretation
Validity Coefficient
A correlation coefficient that provides a measure of the relationship between test scores and scores on a criterion measure Rarely greater than 0.5
Kuder-Richardson formula 20
A formula for computing split-half reliability that corrects for the fact that individual scores are based on only half of the total test items.
Split-half reliability
A measure of reliability in which a test is split into two parts and an individual's scores on both halves are compared. Decreases internal consistency
Internal Consistency
A measure of reliability; are we measuring the same contruct
Kappa statistic
A measure of the degree of nonrandom agreement between observers of the same categorical variable.
coefficient of stability
An estimate of test-retest reliability obtained during time intervals of six months or longer
Standard Error of Measurement
An index of how much error there is in observed scores; amount of rubber in our tape measure Used to compute confidence intervals - 95 means that 95% sure that the true score is within that interval i.e. Gifted or special ED - Creating cut off score with confidence interval How sure can we be that the person meets the cut off
Cronbach's alpha
An indicator of internal consistency reliability assessed by examining the average correlation of each item (question) in a measure with every other question.
Who gets surveyed?
Brightly educated people - 33% of Americans are college grads - Use simple direct language - Avoid colloquialisms, even if all your friends know them
If anyone gives the same answer then, it is most likely not a useful question
Can you conceive of a test item on a rating scale requiring human judgment that all raters will score the same 100% of the time?
What is your construct of interest?
Can you define it? Who is using this info? Is there a specific context - Tends to find fault with others AT WORK
Random Error
Caused by unpredictable flunctuations of other variables in measurement process (i.e. noise)
Generalizabilty theory
Compatible with True Score Theory Non-test error
Utility
Considers how tests and asses benefit society Usually quantitative
Ways of considering utility
Cost efficiency Time Savings Cost-benefit ratio Clinical utility: Some tests are more efficient Selection utility: Can help choose the best people
Item Response Theory
Different than True Score Theory Focus: Item Difficulty: Ranges from 0 (everyone answers the same way) to 1 (Extreme) Allows for adaptive testings; questions adjust depending on how well previous questions were answered Less time testing Requires large pool of items
Parallel forms
Different versions of a test used to assess test reliability; the change of forms reduces effects of direct practice, memory, or the desire of an individual to appear consistent on the same items; psychometrically identical, it has been tested Same means and variances r must be at least .80 Easier for intellectual ability than personality
How to improve reliability
Eliminate poor items Add new items Factor analysis
True Score Theory
Every score has true level of knowledge & some error; (X = T + E) Observed Score = True Score + Error
Test-retest reliability, minimizes cheating, having a parallel form just in case someone gets sick, minimizes practice effects
From the perspective of the test user, what are other possible advantages of having alternate or parallel forms of the same test
Test-retest reliability Minimizes cheating Having a parallel form just in case someone gets sick Minimizes practice effects
From the perspective of the test user, what are other possible advantages of having alternative or parallel forms of the same test?
Audio-recording method: One clinician interviews a patient and assigns diagonoses. Then a second clinician, who does not know what diagnoses were assigned, listens to an audio-recording of the interview and independently assigns diagnoses - They don't ask about remaining symptoms of the disorder - Only the interviewing clinician can follow up patient responses with further questions - Even when semistructured interviews are used it is possible that two highly trained clinicians might obtain different responses The reliability of diagonises is far lower than commonly believed
How is reliability of diagonoses determined? Why does the method affect the estimated reliability? What does this mean?
Discrimination
How much of a distinction is there a difference between those who chase and those that didn't Ranges from -1 to 1 (-1 = reflects opposite of trait) 0.8=excellent, -0.2 to +0.2 = Poor
May not like to shoot someone Increasing suicide rates
How would you describe the non-economic cost of a nation's armed forces using ineffective screening mechanisms to screen military recruits?
Can you conceive of a test item on a rating scale requiring human judgement that all raters will score the same 100% of the time?
If anyone gives the same answer then, it is most likely not a useful question
Spearman-Brown
In order to estimate the internal-consistency reliability of a test, one would use the __________ formula
Why is the phrase valid test sometimes misleading?
It can be context dependent might be valid in some contexts but not the other
Face Validity
Measures whether a test looks like it tests what it is supposed to test. Doesn't really tell us if we are actually indicate actual validity Does not help interpret the scores
Pre-post designs are not optimal because there is no way of telling if treatment worked or not Having an experimental and control group
Might it have been advisable to have simultaneous testing of a matched group of couples who did not participate in sex therapy and simultaneous testing of a matched group of couples who did not consult divorce attorneys? In both instances, would there have been any reason to expect any significant changes in the test scores of these 2 control groups
Federale Express Status Quo
Must have a valid driver's license No criminal record 3 months probation
Deciding in patient eval is not easy Police officer assessment, suspected child abuse, if someone is suicidal, guilty or innocent, has cancer
Provides an example of another situation in which the stakes involving the utility of a tool of psychological assessment are high.
Top-Down Selection
Ranking applicants on the basis of their total scores and selecting from the top down until the desired number of candidates has been selected. Disadvantage: Disparate Impact
Interpreting Reliability
Research= 0.7 or better Diagnosis= 0.9 or better Screening= 0.8 or better Screening is broad and is used to see if a follow up is needed. It is cheaper and is done with multiple people Observed scores are closer to the true score variability is narrow Each observed score has an error. The more subjects we have, the error in the average goes to 0 When n goes up and towards infinity and beyond, the error goes down towards 0
Depression False Positive Consequences
Resources are taken away for those that need it Giving meds to those that don't need it Losing time Unnecessary labels
Things to Avoid
Response with more than 1 meaning, should be clear and unambigious Double barreled questions
Standard Error of the Difference Equation
SD times the square root of 2 - reliability coefficient 1 - reliability coefficient 2 Comparing 2 scores
SEM equation
SD x sqrt of 1 - rxx
History source may not always be accurate Must decide what is important
Test developers who publish history tests must maintain the content validity of their tests. What challenges do they face in doing so?
Parallel Form Reliability
The correlation coefficient determined by comparing the scores of the two similar measuring devices (or forms of the same test) administered to the same people at one time.
Reliability
The degree of consistency in measurement The pre-requisite to validity and is a minimum, but not as a sufficient criterion for validity
Construct Validity
The extent to which there is evidence that a test measures a particular hypothetical construct.
Sampling Error
The level of confidence in the findings of a public opinion poll. The more people interviewed, the more confident one can be of the results.
Analogy of a rubber tape measure
The smaller the standard deviation, the less rubber there is, the more reliable it is, the closer it is to the mean
Content Sampling
The variety of the subject matter contained in the items
Broad Item Strategies
Think about verbs and behaviors related to the construct Remember to change the polarity for some items to promote paying attention Emotions related to behavior related to the construct Think about time related behaviors
Systematic Error
Typically constant to what is presumed to be the true value of measured variable
Depression False Negative Consequences
Undermining the need for help Misdiagnosis
FERT
Utility: Less likely to ruin reputation, narrows down choices, less harm to self-esteem Face Validity: Take test more seriously Content Validity: How did people fail in the past? How can they succeed? Goal is criterion validity
Divergent Evidence
Variables that not theoretically correlated, should go down - Jealousy and social desirability - Marital satisfaction - DAS should NOT have strong correlations with: Career satisfaction, introversion/extroversion
Convergent Evidence
Variables that theoretically should correlate Usually has high correlations - Jealousy is measured with self-report and peer report - Marital satisfaction: DAS should correlate well with - Overall problem-solving, life-satisfaction, lower anxiety
False Positive Rate
We say they will be successful, but they are not
False Negative Rate
We say they won't be successful, but they are
Hit Rate
What % of people do we correctly classify
Miss Rate
What % of people do we incorrectly classify
More items= more reliability (cost) Reliability is pre-requisite for validity, must be consistent (cost) Children might lose attention (benefit) There might be some questions missing (benefit)
What are other situations in which a reduction in test size or the time it takes to administer a test might be desirable? What are the arguments against reducing test size?
Disparate Impact
a condition in which employment practices are seemingly neutral yet disproportionately exclude a protected group from employment opportunities
percent agreement
a measure of interrater reliability in which the percentage of times that the raters agree is computed
Criterion-Related Validity
a measure of validity based on showing a substantial correlation between test scores and job performance scores Measures proficiency of people that are hired already can create restriction of range - Must strive for large range Incremental validity - Does adding this aspect improve prediction? i.e. Does adding the SAT score improve HS GPA prediction
Average proportional distance (APD)
a measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores
test-retest reliability
a method for determining the reliability of a test by comparing a test taker's scores on the same test taken on separate occasions Measures correlation between 2 administrations: when correlation increases, reliability increases Most appropriate for ability and traits Not usually appropriate for states Practice effects reduce this
Factor Analysis
a statistical procedure that identifies clusters of related items (called factors) on a test; used to identify different dimensions of performance that underlie a person's total score.
Error Variance
deviation of the individual scores in each group from their respective group means
Validity
how well a test measures what it is supposed to measure in a particular context i.e. Can personality tests predict salesmanship?
What is the value of face validity from the perspective of the test-user?
i.e. CBCL (Child Behavior Check List) - Widely used: Internalization (Depressive symptoms) Externalization (Disruptive behavior) - Empirically validated - Version for parents and teachers - Parents balk at questions if it isn't face valid. If it was face valid, people take it more seriously
Methodological Error
interviewers may not have been trained properly, the wording in the questionnaire may have been ambiguous, or the items may have somehow been biased to favor one or another of the candidates.
Weighted Kappa
takes into account the consequence of disagreements between examiners, used when all disagreements are not equally consequential
Content Validity
the extent to which a test samples the behavior that is of interest Abstract constructs can't be measured directly - i.e. Openness to experience? - Do you like trying new foods. How many items do we need? Can be established by other instruments, interview experts, and read about construct
True Variance
the portion of the differences in scores that is (theoretically) real
Confidence Intervals
the range on either side of an estimate that is likely to contain the true value for the whole population 90% - X(observed score) +/- 1.65 (SEM) 95% - X(observed score) +/- 1.96 (SEM) 99% - X(observed score) +/- 2.58 (SEM)
Alternate Forms
two or more different forms of a test designed to measure exactly the same skills or abilities, which use the same methods of testing, and which are of equal length and difficulty. In general, if test takers receive similar scores on _________ of a test, this suggests that the test is reliable. Intended to be identical, but untested