exam 3 psych test and measurements
Convergent evidence of validity
Scores converge on similar tests measuring similar things
When conducting item analysis, testing professionals may examine item-total correlations. What are those? When should you examine them? What might suggest you should remove an item?
What are those? Item-total correlations tell you how well an item relates to the average of the other items in a scale. When should you examine them? When a scale is homogeneous (i.e., measures one underlying construct). Items in a homogeneous scale should have high item-total correlations because they should be measuring the same thing. What might suggest you should remove an item? You may want to remove an item in a homogenous scale with a low item-total correlation.
What does it mean for test content to be representative?
Content is covered in a reasonable proportion. The most important content is covered more than the least important content. If everything is equally important, questions should cover content evenly (e.g., 5 questions about item analysis, 5 questions about content validity, 5 questions about criterion-related validity).
Validity is about? A. The consistency of scores on alternate forms of a test B. The similarity of individuals' performance on items within a test C. The extent to which graders agree on an individuals' test score D. The quality of inferences individuals' draw from a test score
D
correlations between scores on different traits measured with different methods A. Monotrait-heteromethod B. Heterotrait-monomethod C. Monotrait-monomethod D. Heterotrait-heteromethod
D
Ethics regarding assessments are... A. primarily prepared by PAR (Psychological Assessment Resources) B. a set of values foundational to psychological testing C. requirements that need to be met to get licensed in psychological testing D. laws established by governing bodies in psychological testing
B
tell you the relationship between scores on an item and the average of scores on the other items in a scale. (e.g., The correlation between scores on Item 1 and average scores on the other items is .86.) This provides a bigger picture overview.
Item-total correlations
Cut scores tend to be _____
arbitrary
What are some ways test developers define the testing universe?
literature reviews, interview experts, survey experts, etc. *To effectively define the testing universe, you must be an expert on the construct you want to measure.
Suppose a class is known for being extremely difficult, and the distribution is positively skewed. The average grade in the class is 65 with a standard deviation of 5 points. At least what percentage of grades in the class are between 50 and 80? A. Approximately 75% B. Approximately 89% C. Approximately 68% D. Approximately 99%
B
The test differentially predicts performance for minority groups compared to Whites The same score between a white person and a minority predicts different levels of performance A. testing bias B. differential prediction C. within-group norming D. obsolete tests
B
Which of the following is an example of differential prediction? A. Black individuals who scores lower than White individuals on the SAT are expected to perform the same in college B. females score lower than males on math tests on average C. people who come from higher socioeconomic backgrounds have more resources than people who come from lower socioeconomic backgrounds D. there are National Merit scholars and Hispanic National Merit scholars
A
Based on the Ethical Principles of Psychologists and Code of Conduct outlined by the American Psychological Association (APA), which of the following is NOT an ethical principle involving assessments? A. Psychologists should obtain informed consent for routine educational assessment B. Psychologists should not promote testing by people who are unqualified C. Psychologists should use assessments for purposes supported by evidence D. Psychologists should use appropriate procedures when designing a test
A
Test bias occurs when... A. poor content validity primarily harms certain groups B. tests are graded differently for different groups C. test developers include disrespectful language against certain groups D. test developers try to prevent minorities from performing well on a test
A
How can testing bias be mitigated? A. It can't B. Allow minorities to take the test twice C. Create a diverse panel to look over the test again D. Use within-group norming
C
How would you measure test-retest reliability? A. Calculate the standard error of measurement B. Use a t-test to examine the mean difference between scores on two testing occasions C. Correlate individual's scores on two occasions of a test D. Calculate Cronbach's alpha
C
What is a criterion? A. A confounding variable B. A moderating variable C. An outcome variable D. An independent variable
C
What types of correlations provide discriminant evidence of validity?
heterotrait-monomethod heterotrait-heteromethod
What types of correlations provide convergent evidence of validity?
monotrait-monomethod monotrait-heteromethod
What is a common problem in criterion-related validity studies?
range restriction - we don't have data on people who are low on the predictor; this attenuates (lessens) the validity coefficient
If a test question has a content validity ratio of .9 ... A. it is a good test question; experts agree that it is important and relevant B. it is a bad test question; test takers do not respond similarly to the question C. it is a bad test question; experts do not think it is important and relevant D. it is a good test question; test takers respond similarly to the question
A
_____ is when we have a question or multiple questions low content validity. The content isn't relevant and ppl from certain demographics groups tend to be more familiar with it then others. A. testing bias B. differential prediction C. within-group norming D. obsolete tests
A
Sources of validity evidence include all of the following EXCEPT A. Evidence based on inter-rater agreement B. Evidence based on test content C. Evidence based on relations with other variables D. Evidence based on response processes
A * A is reliability not validity
Assume a distribution of scores is negatively skewed. What are the mean, median, and mode in order of smallest to largest? A. Median, mean, mode B. Mean, median, mode C. Median, mode, mean D. Mode, median, mean
B
Cut scores are typically... a. difficult to understand b. problematic because measurement is not perfectly reliable c. especially problematic for individuals who score furthest from the cut score d. a great way to differentiate individuals' performance
B
Measure individual attitudes and experiences toward honesty, dependability, trustworthiness, reliability, and prosocial behavior
Integrity tests
What are some controversies relevant to psychological testing?
cut scores, low-quality inferences and decisions, testing fairness (and how to handle it)
What are the 4 steps to establish content evidence of validity before a test is developed?
1. Define the testing universe 2. Develop test specifications 3. Establish a test format 4. Construct test questions Every step should consider the previous step(s). For example, the test specifications should be based on the testing universe.
What are 3 ways to examine content evidence of validity after a test is developed? Which of the 3 is the least informative, and why?
1. Experts review and rate how relevant each test item is to the underlying construct(s). 2. Experts match each test item to the construct it seems to be measuring. They should recreate the test developer's test specifications. 3. Ask test-takers the relevance of each test item (i.e., face validity). Face validity is not strong evidence of validity. Test-takers may not have the expertise to recognize whether an item/question is relevant. However, high face validity is valuable to gain test-takers' approval of a test.
SPSS can be used to do all of the following EXCEPT A. Help inform decisions about which scale items to keep or remove B. Display the distribution of scores on a scale C. Help write good scale items D. Provide Cronbach's alpha of a scale
C
What is within group norming? A. poor content validity primarily harms certain groups B. scores are forced to fit a normal distribution C. scoring depends on a test taker's group membership D. poor reliability primarily harms certain groups
C
When developing a test, what are the recommended steps to improve the content validity? A. 1) establish a test format 2) develop test specifications 3) define the testing universe 4) construct test questions B. 1) develop test specifications 2) define the testing universe 3) establish a test format 4) construct test questions C. 1) define the testing universe 2) develop test specifications 3) establish a test format 4) construct test questions D. 1) develop test specifications 2) establish a test format 3) define the testing universe 4) construct test questions
C
exam performance is what type of criterion? objective subjective
objective
medical errors is what type of criterion? objective subjective
objective
number of dogs adopted is what type of criterion? objective subjective
objective
What are the two types of criterion-related validity studies?
predictive validity study - the criterion is measured after the predictor is measured concurrent validity study - the "predictor" and the criterion are measured at the same time
the correlation between scores on the predictor and the criterion (e.g., the correlation between performance on this jeopardy game and scores on Exam 3).
validity coefficient
· What causes testing bias?
-the test is biased against minorities -minorities don't have access to the same resources and opportunities as Whites (most likely) -genetics (highly doubtful)
test scores are unrelated to scores on tests measuring dissimilar constructs
Discriminant evidence of validity
What is the result of testing bias?
Poor evidence of content validity
the body of knowledge or behaviors that a test represents
Testing universe
Rank the correlations you would find in a MTMM correlation matrix in order of what you would expect/hope to be smallest to largest.
1. heterotrait-heteromethod correlations 2. heterotrait-monomethod correlations 3. monotrait-heteromethod correlations 4. monotrait-monomethod correlations
When conducting item analysis, testing professionals may examine Cronbach's alpha with item removed. What is that? When should it be examined? What might suggest you should remove an item?
What is that? Cronbach's alpha with item removed tells you the internal consistency reliability of a scale if an item is removed. When should it be examined? When a scale is homogeneous (i.e., the items measure one underlying construct). What might suggest you should remove an item? You may want to remove an item if Cronbach's alpha with item removed is high (well above .70). If Cronbach's alpha with an item removed is high, the scale would still have adequate internal consistency reliability without the item. Sometimes removing an item would increase Cronbach's alpha of the scale, which is a red flag for considering removal.
How do you interpret a validity coefficient
You interpret it just like any other correlation. It is a value between -1 and 1, and both the direction and magnitude of the validity coefficient should be interpreted.
An attribute, trait, or characteristic that is not directly observable but can be inferred by looking at observable behaviors
construct *examples: Aggression, Intelligence, Dog Lover, Environmental Activism
An outcome we expect is associated with test scores For example, your performance on this jeopardy game may be associated with your Exam 3 score. Your Exam 3 score would be the _____
criterion
patient satisfaction is what type of criterion? objective subjective
subjective
supervisor rating of job performance is what type of criterion? objective subjective
subjective
Suppose you know the answers to all of the questions in this jeopardy game, and you conclude you are a genius. Thoroughly evaluate relevant evidence, and explain the quality of the inference.
Inference: I am a genius. Evidence based on test content: The content does not representatively capture content relevant to being a genius (it only captures content relevant to psych tests and measurement). It leaves out a lot of things that are important (e.g., verbal reasoning, spatial intelligence, problem-solving). It measures things that are irrelevant (specific knowledge about tests and measurement). Evidence based on relations with criteria: There is no evidence to suggest that performance on this jeopardy game is associated with genius outcomes (e.g., winning a Nobel peace prize, being deemed an expert in your field). Evidence based on relations with constructs: There is no evidence to suggest that your performance on this jeopardy game is associated with your performance on an IQ test or another measure of genius-ness. There is no evidence to suggest you are genius. (You might be, but your score on this jeopardy game is irrelevant.)
tell you the relationship between scores on an item and scores on every other item in a scale. (e.g., The correlation between scores on Item 1 and Item 2 is .86. The correlation between scores on Item 1 and Item 3 is .92. The correlation between scores on Item 1 and Item 4 is .67.) This provides detailed information at the item level.
Interitem correlations
What 3 questions are relevant to content validity?
Is the test content representative? Does it leave out anything important? Does it measure anything irrelevant?
statistics that are used to evaluate the quality of items in a scale or questions in a test (e.g., Cronbach's alpha with item removed, item-total correlations, inter-item correlations)
Item analysis
What is one way to examine content validity after a test is developed? A. ask experts to match every test question to the content area that is covered B. ask test takers if their test score relates to their test scores in other classes C. ask test takers to take an alternate form of the test D. use individuals' test scores to predict their future success in a related field
A
Suppose a university is selecting applicants based on their SAT score. If the validity coefficient is .9 ... A. they are likely to have more false negatives than true negatives B. they are likely to have more false negatives than false positives C. they are likely to have more false positives than false negatives D. they are likely to have more true positives than false positives
D
Suppose you are developing a selection test for a company (i.e., a test that will help the company make hiring decisions). Which of the following could you do to help define the testing universe? A. Conduct a predictive validity study B. Conduct an internal consistency analysis C. Conduct a concurrent validity study D. Conduct a job analysis
D
Which of the following is a common problem when examining predictive validity? A. the observed predictor-criterion relationship is artificially high B. the validity coefficient is difficult to estimate C. the predictor is measured before the criterion D. the full range of data is unavailable
D
correlations between scores on the same trait measured with the same method (i.e., reliability) A. Heterotrait-monomethod B. Heterotrait-heteromethod C. Monotrait-heteromethod D. Monotrait-monomethod
D
What is item difficulty? How do you interpret it? What is the ideal average item difficulty of a test?
Item difficulty is the percentage of test-takers who answer a test question correctly (converted to a decimal; e.g., If 90% of test-takers get an item correct, the item difficulty is .90). Values above .9 may indicate an item is too easy (i.e., a large percentage of test-takers are getting it correct). Values below .2 may indicate an item is too hard (i.e., a small percentage of test-takers are getting it correct). Items that are too easy or too hard do not help to distinguish between test-takers because everyone tends to answer similarly. You want diversity in item difficulty, with most items having an item difficulty between .2 and .9. Ideally, the average item difficulty is around .5 (50% of test-takers answer correctly) to optimally distinguish between test-takers.
Suppose a class of students took a math test. Male students got higher scores than female students on average. Does that mean the math test was biased? Yes No
No
Compare the advantages and disadvantages of objective criteria versus subjective criteria. Consider at least 4 differences.
Objective measures: Less subject to rater biases; Prone to recording errors; Deficient (Miss important performance aspects); Contaminated (Reflect factors outside of individuals' control) Subjective measures: Prone to rater biases; Not very prone to recording errors; Capture a broader criterion domain (less deficient); Can be contaminated by rater biases but may be less contaminated by other factors
a statistical program that can run statistics, such as item analyses, on large amounts of data
SPSS
Which of the following is evidence related to criterion-related validity? A. ask experts to match test questions to the topics they measure B. scores on the SAT are moderately, positively associated with college GPA C. test takers believe that a test measures what it is supposed to measure D. when test takers retake a test, their scores at time 1 are strongly, positively associated with their scores at time 2
B
Which of the following questions is NOT relevant to content validity? A. Do the test questions measure anything irrelevant? B. Do scores on the test predict meaningful outcomes? C. Is the test content representative? D. Does the test fail to assess any important concepts?
B
correlations between scores on the same trait measured with different methods A. Monotrait-heteromethod B. Heterotrait-monomethod C. Monotrait-monomethod D. Heterotrait-heteromethod
A
What are the 11 ethical principles regarding assessments?
Bases of assessments Use of assessments Informed consent in assessments Release of test data Test construction Interpreting assessment results Assessment by unqualified persons Obsolete tests and outdated test results Test scoring and interpretation services Explaining assessment results Maintain test security
A clinical psychologist who uses psychotherapy assessments (e.g., Rorschach) to promote employees in an organization is violating what ethical principle? A. Obsolete Tests and Outdated Results B. Maintaining Test Security C. Use of Assessments D. Test Construction
C
An assessment psychologist who made a test without defining the testing universe and outlining test specifications violated what ethical principle? A. Ensuring Testing Fairness B. Test Scoring and Interpretation Services C. Test Construction D. Obsolete Tests and Outdated Test Results
C
For a good survey item, the item-total correlation is... A. High if the survey is standardized B. Low if the survey is homogeneous C. High if the survey is homogenous D. Low if the survey is standardized
C
When your score on a test is only compared to those in your demographic group (Hispanics with Hispanics, whites with whites etc.) A. testing bias B. differential prediction C. within-group norming D. obsolete tests
C
correlations between scores on different traits measured with the same method A. Monotrait-monomethod B. Monotrait-heteromethod C. Heterotrait-monomethod D. Heterotrait-heteromethod
C
which of the following would be an appropriate example of a construct being measured by the dog lover scale? A. level of love for dogs B. amount of time spent with dogs C. love for dogs D. level of knowledge about dogs
C
3 questions asked by _____ Ø Are test scores strongly, positively associated with scores on similar constructs? Ø Are test scores strongly, positively associated with scores on the same construct (e.g., reliability)? Ø Are test scores unrelated to scores on dissimilar constructs?
Construct validity
test scores are strongly, positively associated with scores on tests measuring similar constructs
Convergent evidence of validity
If Cronbach's alpha of a scale is .60 A. The test is too easy B. The test is too difficult C. You may want to remove items for redundancy D. The reliability of the scale is inadequate (i.e., not acceptable)
D
Which of the following is NOT an example of validity evidence? A. Scores on this quiz are consistent with scores on a parallel form of this quiz (alternate reliability not validity) B. Scores on this quiz are associated with the amount of time students spent preparing C. While you are taking this quiz, you are thinking about content from the last lecture and reading assignment D. This quiz has 5 questions on Monday's material and 5 on Wednesday's material
A
How a percentage on a test will determine what grade you receive and whether it's passing or falling. Ex. An 80 on a test will get you a "B" Or 225 - 250 pts in a class will get you an "A" for the semester these are examples of? A. grading on a curve B. cut scores C. percentile score D. grading up or grading down
B
If you score a 5 out of 5 on this quiz, you may conclude that you are a genius. That would be a(n)... A. Rational inference B. Irrational inference C. Rational decision D. Irrational decision
B
The textbook tells the story of Michael who was administered an intelligence test by his school. The school determined he was "retarded," and they moved him into a special education class. When his parents asked about his intelligence score, the prinicipal said it was better to leave such matters to school authorities. Which ethical principle of assessments was violated? A. Test Construction B. Explaining Assessment Results C. Maintaining Test Security D. Obsolete Tests and Outdated Test Results
B
Which of the following is a subjective criterion? A. dollar amount of sales B. supervisor ratings of performance C. time to produce widgets D. number of absences
B
