Testing & Measurement - Chapters 4-7
Researchers often look for reliability of at least _____.
.9
A reliability of a difference score is expected to be _____. a. lower than the reliability of either test on which it is based b. higher than the reliability of one test and lower than the reliability of the other test on which it is based c. higher than the reliability of both tests on which it is based d. unrelated to the reliability of either test on which it is based
A
A review of the results of funding decisions made by the National Institutes of Health found that the lower rates of approval found for Asian applicants was associated with the applicant's _____. a. citizenship status b. English-language skills c. institutional affiliation d. area of expertise
A
A test has a reliability coefficient of .77. This coefficient means that _____. a. 77% of the variance in test scores is true score variance, and 23% is error variance b. 77% of items on this test are reliable and 23% of the items are unreliable c. 23% of the variance in test scores is true score variance, and 77% is error variance d. 77% of the variance in scores is unexplained variance, and 23% is error variance
A
Administering a test to a group of individuals, re-administering the same test to the same group at a later time, and correlating test scores at times 1 and 2 demonstrates which method of estimating reliability? a. test-retest method b. alternative forms method c. split-half method d. internal consistency method
A
As a result of the Supreme Court's ruling in Griggs v. Duke Power, employers must be able to provide evidence that tests used to make selection and promotion decisions _____. a. measure capabilities that are specific to particular jobs or situations b. improve the overall quality of the workforce c. can be easily administered and scored by company managers or supervisors d. are well-established general measures of intelligence
A
Construct-irrelevant variance is closest to which of the following concepts? a. measurement error b. standard deviation c. coefficient of determination d. cross-validation
A
Dr. Marcus analyzes items on his physics test by calculating the proportion of students who got each item correct. Dr. Marcus is examining _____. a. item difficulty b. the guessing threshold c. item discriminability d. the antimode
A
If a correction for guessing is used on a test, it means that _____. a. test scores will be adjusted to take into account the percentage of items examinees are expected to answer correctly by chance alone b. examinees who guess on items they do not know the answer to will receive lower test scores than if they had simply left the unknown items blank c. test scores are curved to reflect the percentage of examinees who are expected to guess on 50% or more of test items d. test scores of examinees who make random guesses will be artificially inflated compared to the scores of those who do not make random guesses
A
In the context of psychological assessment, anxiety is an example of a _____. a. subject variable b. scoring or rating error c. standardized procedure d. self-serving bias
A
Sources of error associated with time-sampling (e.g., practice effects, carry-over effects) are best expressed in _____ coefficients, whereas error associated with the use of particular items is best expressed in _____ coefficients. a. test-retest reliability; internal consistency reliability b. alternate forms reliability; interrater reliability c. internal consistency reliability; test-retest reliability d. split-half reliability; alternate forms reliability
A
Strong interrater (or interjudge) reliability is probably MOST important to which type of test? a. behavior rating scales b. structured (objective) personality tests c. achievement tests d. aptitude tests
A
Test constructors can improve reliability by _____. a. increasing the number of items on a test b. decreasing the number of items on the test c. retaining items that measure sources of error variation d. increasing the number of possible responses to each item
A
Tests based on Item Response Theory (IRT) _____. a. yield scores reflecting the level of difficulty of items examinees were able to answer correctly b. are less adaptable to computer administration than are traditional tests c. may be more biased toward examinees who are slow in completing tests d. define total test scores in term of the number of items examinees answered correctly
A
A test comprised of items to which responses can vary on a 6-point scale ranging from "Strongly agree" to "Strongly disagree" is an example of _____. a. a polytomous format b. a Likert-scale format c. a category format d. a dichotomous format
B
Developers of a sensation-seeking scale found that scores on the scale were highly correlated with self-reported frequency of alcohol and drug use. This finding most clearly provides _____ evidence for the validity of the sensation-seeking scale. a. concurrent criterion-related b. convergent construct-related c. discriminant construct-related d. content-related
B
Dr. Lansing found that the correlation between scores on a measure of anxiety she developed and scores on an existing measure of depression was .82. It is likely that reviewers of Dr. Lansing's anxiety measure will view this correlation as evidence against _____ validity. a. predictive criterion-related b. discriminant construct-related c. convergent construct-related d. content-related
B
If researchers found a criterion validity coefficient of .40 for an achievement test, it would mean that _____. a. 40% of the variation in achievement test scores can be explained by variation in the criterion. b. 16% of the variation in the criterion can be explained by variation in achievement test scores. c. 16% of the variation in achievement test scores can be explained by variation in the criterion. d. 40% of the variation in the criterion can be explained by variation in achievement test scores.
B
KR20 and coefficient alpha are both measures of the extent to which items on a test are _____. a. clearly related to the construct being measured b. intercorrelated c. of an appropriate level of difficulty d. truly measuring what they purport to measure
B
Researchers have found that a _____ scale on a test employing a category format is sufficient for discriminating among individuals. a. 5-point b. 10-point c. 20-point d. 100-point
B
Validity is best understood as the _____. a. ability of the test to measure consistently across time and cultures b. extent to which the test measures what it claims to measure c. collection of information and evidence available about a test d. degree to which a test's measurements are free of errors
B
Which of the following is NOT a way to address low reliability? a. Increase sample size b. Increase the number of items c. Discriminability analysis d. Correction for attenuation
B
Which of the following is a recommendation regarding the evaluation of validity coefficients found in the Standards for Psychological and Educations Testing? a. Check that the predictor score has adequate variability and the criterion is held constant. b. Ensure that the results of the validity study are sufficiently generalizable to other groups. c. Be sure the validity coefficient is sufficiently strong, at least .30 or better for all potential subgroups. d. Be sure the validity study was conducted on a population different from the group to which inferences will be made.
B
Which phenomenon may explain 50% to 80% of the difference between males and females on the SAT Math section? a. expectancy effects b. stereotype threat c. Rosenthal effects d. the effects of hormones on brain development
B
A correlation of _____ is seen as "good enough" for basic research. a. .50 to .60 b. .60 to .70 c. .70 to .80 d. .80 to .90
C
A measure can be _____ yet not _____. a. valid; reliable b. useful; valid c. reliable; valid d. trustworthy; useful
C
Administering two supposedly equivalent forms of test (e.g., Form A and Form B) to the same group of individuals yields a correlation coefficient indicating _____. a. test-retest reliability b. split-half reliability c. alternative forms reliability d. internal consistency reliability
C
After taking an exam in a psychology course, a student makes this comment: "All the questions on the test came from Chapter Three, but the professor said the exam was over material in Chapters One through Six! What a waste of study time!" The student's criticism of the professor's exam is most obviously related to _____ evidence of validity. a. construct-related b. concurrent criterion-related c. content-related d. predictive criterion-related
C
If Dr. Hamline wants to know whether the most difficult items on the statistics test she created were answered correctly only by the top-performing students in her class, she would need to assess _____. a. item difficulty b. the guessing threshold c. item discriminability d. the antimode
C
On a multiple choice test with four response options, the optimum item difficulty level is about _____. a. .250 b. .500 c. .625 d. .875
C
The ability test score of ten-year-old Stacy is most likely to be highest if her examiner is _____. a. familiar to her, does not initiate conversation, and does not comment on her performance. b. unfamiliar to her, does not initiate conversation, and does not comment on her performance. c. familiar to her, initiates conversation, and praises her performance. d. unfamiliar to her, initiates conversation, and praises her performance.
C
The most we could say about a self-esteem test consisting of the items "I feel I deserve to be treated with respect" and "I think I am worthless" is _____. a. content-related evidence of validity b. criterion-related evidence of validity c. face validity d. construct-related evidence of validity
C
Which of the following allows test developers to estimate what the correlation between two measures would have been if they had not been measured with error? a. standard error of measurement b. reliability of a difference score c. correction for attenuation d. Spearman-Brown prophecy formula
C
Which of the following has NOT been found with regard to the effects of reinforcement on test performance? a. Culturally-relevant verbal reinforcement has a significant effect on IQ test scores of African- American children. b. Children's socioeconomic class and gender are related to effects of reinforcement on test performance. c. Effects of tangible reinforcements such as money and candy are significantly greater than effects of verbal praise on test performance. d. None of these choices reflects the research findings on the effects of reinforcement on test performance.
C
Which of the following is not a potential problem created by the inclusion of ineffective distractors on a test? a. They can decrease the reliability of a test. b. They can give clues to examinees about the correct response. c. They can decrease the time examinees spend on items. d. They can decrease the validity of a test.
C
Which statement about the effect of biases on standardized test results is most accurate? a. There is no real evidence that bias impacts test results. b. Bias can impact test results only when it is overt and extreme. c. Newer evidence suggests that bias does impact test results. d. What bias might occur can easily be mitigated.
C
A study of 22 graduate students found that scoring errors on the WAIS-R diminished only after the students had completed _____ or more practice sessions. a. three b. five c. eight d. ten
D
A test comprised of items such as "I usually feel rested when I wake up in the morning" to which examinees must respond "True" or "False" is an example of a _____. a. polytomous format b. Likert-scale format c. category format d. dichotomous format
D
Computer-assisted test administration _____. a. is generally less reliable than traditional assessment b. requires that the same items be presented in the same order to test takers c. typically yields test scores somewhat lower than those of paper-and-pencil tests d. yields lower rates of scoring errors than traditional assessment procedures
D
the difference between a true score and an observed score
error
tendency for results to be influenced by what experimenters or test administrators expect to find; can be very subtle and are often unintentional or even totally unconscious
expectancy effects
usefulness of research findings in groups other than those who participated in the original validation studies
external validity
extent to which items on a test appear to be meaningful and relevant
face validity
presentation of an exam on a computer, with automatic recording of responses
interactive testing
Alpha coefficient, KR20, and split-half method are all used to calculate _____.
internal consistency
intercorrelation of items within the same test
internal consistency
set of methods used to evaluate test stimuli; general term for a set of methods used to evaluate test items
item analysis
graph showing relationship between total test score and proportion of test takers passing a particular stimulus; total test score falls on x-axis; percent who answered correct falls on y-axis
item characteristic curve
analysis used to assess how hard stimuli are; asks what percent of people got an item correct
item difficulty
how well a stimulus performs in relation to some criterion; determines whether people who have done well on a particular item have also done well on the entire test
item discriminability
technique that moves away from classical test theory; focuses on the difficulty of questions that will help assess an individual's ability level; adaptive assessment method that adjusts as test taker is more or less successful on items; looks at test quality by examining chances of getting an item right or wrong
item response theory
same construct or attribute may be assessed using a wide pool of units
item sampling
The distribution of observed scores for repeated testing of the same person are _____.
normal
In the equation X = T + E, what do the letters stand for?
observed score; true score; error
situation where multiple judges looking at the same event may record disparate numbers
observer differences
Item sampling error can be assessed using the _____ method.
parallel forms
compares scores on two different measures of the same quality
parallel forms
assessment method to evaluate the error associated with the use of a particular set of items
parallel forms reliability
test item format in which three or more alternative responses are given for each item
polytomous format
indication that a test forecasts scores on the criterion at some future time
predictive validity evidence
Classical test theory assumes that errors of measurement are _____.
random
act of giving feedback which could impact a person taking an exam
reinforcement
the ratio of true variability to observed variability
reliability
when a test is split into two equal halves and each half is compared to the other; underestimates the true reliability
split-half method
standard deviation of a set of observations for the same test; a very useful index of reliability because it leads to confidence intervals
standard error of estimate
situation where members of certain social groups feel pressure to disconfirm negative opinions about their group
stereotype threat
act of giving an exam
test administration
person who gives an exam
test administrator
apprehension that occurs in exam-taking situations
test anxiety
Time sampling error can be assessed using the _____ method.
test-retest
the same test given at different moments may produce different scores, even if given to the same test takers
time sampling
degree to which a certain inference or interpretation based on a test is appropriate
validity
relationship between a test and the related criterion; typically between .30 and .40 to be considered adequate; rarely exceeds .60
validity coefficient
The notion of a "rubber yardstick" explains how distributions may be more or less _____.
varied
In a recent study, teachers were trained to raise expectations for math performance for all students. A control group did not get this training. What was found? a. The achievement of male students was facilitated by the intervention, while that of female students was actually impeded. b. When teachers had strong biases against certain groups of students, that bias was removed, although there were no effects on actual student achievement. c. There were no effects on student performance, largely because teacher biases typically reflect "real" differences in ability. d. Students whose teachers had high expectations training achieved significantly great gains in math performance in comparison to those whose teachers did not get this training.
D
Research on the effects of the examiner's race on children's IQ test performance indicates _____. a. substantial negative effects when the examiner is African-American and the child is white b. substantial negative effects when the examiner is white and the child is African American c. substantial positive effects when the examiner and the child are the same race d. minimal, if any, effects
D
The relative closeness of a person's observed score to his/her true score is estimated by the _____. a. test-retest reliability coefficient b. internal consistency reliability coefficient c. standard deviation d. standard error of measurement
D
Which of the following is NOT one of the 3 types of validity evidence? a. criterion-related b. content-related c. construct-related d. convergent-related
D
Which of the following item characteristic curves represents the best item? a. a curve that rises steadily to the midpoint of test performance and then falls steadily to the highest performance levels (i.e., an inverted U-shaped curve) b. a curve that is flat to the midpoint of test performance, rises sharply after the midpoint, and then flattens out at the highest performance levels c. a curve that drops steadily to the midpoint of test performance and then rises steadily to the highest performance levels (i.e., a U-shaped curve) d. a curve that rises steadily and smoothly to the highest performance levels
D
formula for estimating the internal consistency of a test; calculates reliability when a criterion is dichotomous
KR20 (Kuder-Richardson 20)
statistic used to address inter-rater reliability
Kappa statistic
test item format for attitude scales in which subjects indicate their degree of agreement to statements
Likert format
formula used to correct for half-length (split-half) method and increases the estimate of reliability
Spearman-Brown
an issue when random changes occur over time; can include practice effects
carryover effects
rating-scale format that often uses the categories 1 to 10
category format
generalized method for estimating reliability; useful for tests where there is no single correct answer
coefficient alpha
indication in which the test and the criterion are administered at the same point in time; comes from assessing the simultaneous relationship between a test and criterion
concurrent validity evidence
a range of likely scores within which the correct value is believed to exist
confidence interval
process used to establish the meaning of a test through a series of studies
construct validity evidence
indication that the content of a test represents the conceptual domain it is designed to cover
content validity evidence
indication that a test measures the same attribute as others that supposedly measure the same thing; occurs when there is a high correlation between two or more tests that purport to assess the same criterion
convergent evidence
formula to estimate what the correlation would have been if the variables had been perfectly reliable
correction for attenuation
Reliability estimates are _____.
correlations
indication that a test score corresponds to an accurate measure of interest
criterion validity evidence
test item format in which there are two alternatives for each item
dichotomous format
occurs when one value is subtracted from another; most often calculated after converting to a Z score
difference score
how well an item performs in relation to some criterion; a statistical measure of how well an item on a test differentiates among subgroups of test takers
discriminability analysis
indication that a test measures something different from what other available tests measure
discriminant evidence
set of alternatives on a multiple-choice exam that are not correct
distractors
a limited number of measurements (sample) to describe a larger, more complex construct
domain sampling method
