Testing & Measurement - Chapters 4-7

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Researchers often look for reliability of at least _____.

.9

A reliability of a difference score is expected to be _____. a. lower than the reliability of either test on which it is based b. higher than the reliability of one test and lower than the reliability of the other test on which it is based c. higher than the reliability of both tests on which it is based d. unrelated to the reliability of either test on which it is based

A

A review of the results of funding decisions made by the National Institutes of Health found that the lower rates of approval found for Asian applicants was associated with the applicant's _____. a. citizenship status b. English-language skills c. institutional affiliation d. area of expertise

A

A test has a reliability coefficient of .77. This coefficient means that _____. a. 77% of the variance in test scores is true score variance, and 23% is error variance b. 77% of items on this test are reliable and 23% of the items are unreliable c. 23% of the variance in test scores is true score variance, and 77% is error variance d. 77% of the variance in scores is unexplained variance, and 23% is error variance

A

Administering a test to a group of individuals, re-administering the same test to the same group at a later time, and correlating test scores at times 1 and 2 demonstrates which method of estimating reliability? a. test-retest method b. alternative forms method c. split-half method d. internal consistency method

A

As a result of the Supreme Court's ruling in Griggs v. Duke Power, employers must be able to provide evidence that tests used to make selection and promotion decisions _____. a. measure capabilities that are specific to particular jobs or situations b. improve the overall quality of the workforce c. can be easily administered and scored by company managers or supervisors d. are well-established general measures of intelligence

A

Construct-irrelevant variance is closest to which of the following concepts? a. measurement error b. standard deviation c. coefficient of determination d. cross-validation

A

Dr. Marcus analyzes items on his physics test by calculating the proportion of students who got each item correct. Dr. Marcus is examining _____. a. item difficulty b. the guessing threshold c. item discriminability d. the antimode

A

If a correction for guessing is used on a test, it means that _____. a. test scores will be adjusted to take into account the percentage of items examinees are expected to answer correctly by chance alone b. examinees who guess on items they do not know the answer to will receive lower test scores than if they had simply left the unknown items blank c. test scores are curved to reflect the percentage of examinees who are expected to guess on 50% or more of test items d. test scores of examinees who make random guesses will be artificially inflated compared to the scores of those who do not make random guesses

A

In the context of psychological assessment, anxiety is an example of a _____. a. subject variable b. scoring or rating error c. standardized procedure d. self-serving bias

A

Sources of error associated with time-sampling (e.g., practice effects, carry-over effects) are best expressed in _____ coefficients, whereas error associated with the use of particular items is best expressed in _____ coefficients. a. test-retest reliability; internal consistency reliability b. alternate forms reliability; interrater reliability c. internal consistency reliability; test-retest reliability d. split-half reliability; alternate forms reliability

A

Strong interrater (or interjudge) reliability is probably MOST important to which type of test? a. behavior rating scales b. structured (objective) personality tests c. achievement tests d. aptitude tests

A

Test constructors can improve reliability by _____. a. increasing the number of items on a test b. decreasing the number of items on the test c. retaining items that measure sources of error variation d. increasing the number of possible responses to each item

A

Tests based on Item Response Theory (IRT) _____. a. yield scores reflecting the level of difficulty of items examinees were able to answer correctly b. are less adaptable to computer administration than are traditional tests c. may be more biased toward examinees who are slow in completing tests d. define total test scores in term of the number of items examinees answered correctly

A

A test comprised of items to which responses can vary on a 6-point scale ranging from "Strongly agree" to "Strongly disagree" is an example of _____. a. a polytomous format b. a Likert-scale format c. a category format d. a dichotomous format

B

Developers of a sensation-seeking scale found that scores on the scale were highly correlated with self-reported frequency of alcohol and drug use. This finding most clearly provides _____ evidence for the validity of the sensation-seeking scale. a. concurrent criterion-related b. convergent construct-related c. discriminant construct-related d. content-related

B

Dr. Lansing found that the correlation between scores on a measure of anxiety she developed and scores on an existing measure of depression was .82. It is likely that reviewers of Dr. Lansing's anxiety measure will view this correlation as evidence against _____ validity. a. predictive criterion-related b. discriminant construct-related c. convergent construct-related d. content-related

B

If researchers found a criterion validity coefficient of .40 for an achievement test, it would mean that _____. a. 40% of the variation in achievement test scores can be explained by variation in the criterion. b. 16% of the variation in the criterion can be explained by variation in achievement test scores. c. 16% of the variation in achievement test scores can be explained by variation in the criterion. d. 40% of the variation in the criterion can be explained by variation in achievement test scores.

B

KR20 and coefficient alpha are both measures of the extent to which items on a test are _____. a. clearly related to the construct being measured b. intercorrelated c. of an appropriate level of difficulty d. truly measuring what they purport to measure

B

Researchers have found that a _____ scale on a test employing a category format is sufficient for discriminating among individuals. a. 5-point b. 10-point c. 20-point d. 100-point

B

Validity is best understood as the _____. a. ability of the test to measure consistently across time and cultures b. extent to which the test measures what it claims to measure c. collection of information and evidence available about a test d. degree to which a test's measurements are free of errors

B

Which of the following is NOT a way to address low reliability? a. Increase sample size b. Increase the number of items c. Discriminability analysis d. Correction for attenuation

B

Which of the following is a recommendation regarding the evaluation of validity coefficients found in the Standards for Psychological and Educations Testing? a. Check that the predictor score has adequate variability and the criterion is held constant. b. Ensure that the results of the validity study are sufficiently generalizable to other groups. c. Be sure the validity coefficient is sufficiently strong, at least .30 or better for all potential subgroups. d. Be sure the validity study was conducted on a population different from the group to which inferences will be made.

B

Which phenomenon may explain 50% to 80% of the difference between males and females on the SAT Math section? a. expectancy effects b. stereotype threat c. Rosenthal effects d. the effects of hormones on brain development

B

A correlation of _____ is seen as "good enough" for basic research. a. .50 to .60 b. .60 to .70 c. .70 to .80 d. .80 to .90

C

A measure can be _____ yet not _____. a. valid; reliable b. useful; valid c. reliable; valid d. trustworthy; useful

C

Administering two supposedly equivalent forms of test (e.g., Form A and Form B) to the same group of individuals yields a correlation coefficient indicating _____. a. test-retest reliability b. split-half reliability c. alternative forms reliability d. internal consistency reliability

C

After taking an exam in a psychology course, a student makes this comment: "All the questions on the test came from Chapter Three, but the professor said the exam was over material in Chapters One through Six! What a waste of study time!" The student's criticism of the professor's exam is most obviously related to _____ evidence of validity. a. construct-related b. concurrent criterion-related c. content-related d. predictive criterion-related

C

If Dr. Hamline wants to know whether the most difficult items on the statistics test she created were answered correctly only by the top-performing students in her class, she would need to assess _____. a. item difficulty b. the guessing threshold c. item discriminability d. the antimode

C

On a multiple choice test with four response options, the optimum item difficulty level is about _____. a. .250 b. .500 c. .625 d. .875

C

The ability test score of ten-year-old Stacy is most likely to be highest if her examiner is _____. a. familiar to her, does not initiate conversation, and does not comment on her performance. b. unfamiliar to her, does not initiate conversation, and does not comment on her performance. c. familiar to her, initiates conversation, and praises her performance. d. unfamiliar to her, initiates conversation, and praises her performance.

C

The most we could say about a self-esteem test consisting of the items "I feel I deserve to be treated with respect" and "I think I am worthless" is _____. a. content-related evidence of validity b. criterion-related evidence of validity c. face validity d. construct-related evidence of validity

C

Which of the following allows test developers to estimate what the correlation between two measures would have been if they had not been measured with error? a. standard error of measurement b. reliability of a difference score c. correction for attenuation d. Spearman-Brown prophecy formula

C

Which of the following has NOT been found with regard to the effects of reinforcement on test performance? a. Culturally-relevant verbal reinforcement has a significant effect on IQ test scores of African- American children. b. Children's socioeconomic class and gender are related to effects of reinforcement on test performance. c. Effects of tangible reinforcements such as money and candy are significantly greater than effects of verbal praise on test performance. d. None of these choices reflects the research findings on the effects of reinforcement on test performance.

C

Which of the following is not a potential problem created by the inclusion of ineffective distractors on a test? a. They can decrease the reliability of a test. b. They can give clues to examinees about the correct response. c. They can decrease the time examinees spend on items. d. They can decrease the validity of a test.

C

Which statement about the effect of biases on standardized test results is most accurate? a. There is no real evidence that bias impacts test results. b. Bias can impact test results only when it is overt and extreme. c. Newer evidence suggests that bias does impact test results. d. What bias might occur can easily be mitigated.

C

A study of 22 graduate students found that scoring errors on the WAIS-R diminished only after the students had completed _____ or more practice sessions. a. three b. five c. eight d. ten

D

A test comprised of items such as "I usually feel rested when I wake up in the morning" to which examinees must respond "True" or "False" is an example of a _____. a. polytomous format b. Likert-scale format c. category format d. dichotomous format

D

Computer-assisted test administration _____. a. is generally less reliable than traditional assessment b. requires that the same items be presented in the same order to test takers c. typically yields test scores somewhat lower than those of paper-and-pencil tests d. yields lower rates of scoring errors than traditional assessment procedures

D

the difference between a true score and an observed score

error

tendency for results to be influenced by what experimenters or test administrators expect to find; can be very subtle and are often unintentional or even totally unconscious

expectancy effects

usefulness of research findings in groups other than those who participated in the original validation studies

external validity

extent to which items on a test appear to be meaningful and relevant

face validity

presentation of an exam on a computer, with automatic recording of responses

interactive testing

Alpha coefficient, KR20, and split-half method are all used to calculate _____.

internal consistency

intercorrelation of items within the same test

internal consistency

set of methods used to evaluate test stimuli; general term for a set of methods used to evaluate test items

item analysis

graph showing relationship between total test score and proportion of test takers passing a particular stimulus; total test score falls on x-axis; percent who answered correct falls on y-axis

item characteristic curve

analysis used to assess how hard stimuli are; asks what percent of people got an item correct

item difficulty

how well a stimulus performs in relation to some criterion; determines whether people who have done well on a particular item have also done well on the entire test

item discriminability

technique that moves away from classical test theory; focuses on the difficulty of questions that will help assess an individual's ability level; adaptive assessment method that adjusts as test taker is more or less successful on items; looks at test quality by examining chances of getting an item right or wrong

item response theory

same construct or attribute may be assessed using a wide pool of units

item sampling

The distribution of observed scores for repeated testing of the same person are _____.

normal

In the equation X = T + E, what do the letters stand for?

observed score; true score; error

situation where multiple judges looking at the same event may record disparate numbers

observer differences

Item sampling error can be assessed using the _____ method.

parallel forms

compares scores on two different measures of the same quality

parallel forms

assessment method to evaluate the error associated with the use of a particular set of items

parallel forms reliability

test item format in which three or more alternative responses are given for each item

polytomous format

indication that a test forecasts scores on the criterion at some future time

predictive validity evidence

Classical test theory assumes that errors of measurement are _____.

random

act of giving feedback which could impact a person taking an exam

reinforcement

the ratio of true variability to observed variability

reliability

when a test is split into two equal halves and each half is compared to the other; underestimates the true reliability

split-half method

standard deviation of a set of observations for the same test; a very useful index of reliability because it leads to confidence intervals

standard error of estimate

situation where members of certain social groups feel pressure to disconfirm negative opinions about their group

stereotype threat

act of giving an exam

test administration

person who gives an exam

test administrator

apprehension that occurs in exam-taking situations

test anxiety

Time sampling error can be assessed using the _____ method.

test-retest

the same test given at different moments may produce different scores, even if given to the same test takers

time sampling

degree to which a certain inference or interpretation based on a test is appropriate

validity

relationship between a test and the related criterion; typically between .30 and .40 to be considered adequate; rarely exceeds .60

validity coefficient

The notion of a "rubber yardstick" explains how distributions may be more or less _____.

varied

In a recent study, teachers were trained to raise expectations for math performance for all students. A control group did not get this training. What was found? a. The achievement of male students was facilitated by the intervention, while that of female students was actually impeded. b. When teachers had strong biases against certain groups of students, that bias was removed, although there were no effects on actual student achievement. c. There were no effects on student performance, largely because teacher biases typically reflect "real" differences in ability. d. Students whose teachers had high expectations training achieved significantly great gains in math performance in comparison to those whose teachers did not get this training.

D

Research on the effects of the examiner's race on children's IQ test performance indicates _____. a. substantial negative effects when the examiner is African-American and the child is white b. substantial negative effects when the examiner is white and the child is African American c. substantial positive effects when the examiner and the child are the same race d. minimal, if any, effects

D

The relative closeness of a person's observed score to his/her true score is estimated by the _____. a. test-retest reliability coefficient b. internal consistency reliability coefficient c. standard deviation d. standard error of measurement

D

Which of the following is NOT one of the 3 types of validity evidence? a. criterion-related b. content-related c. construct-related d. convergent-related

D

Which of the following item characteristic curves represents the best item? a. a curve that rises steadily to the midpoint of test performance and then falls steadily to the highest performance levels (i.e., an inverted U-shaped curve) b. a curve that is flat to the midpoint of test performance, rises sharply after the midpoint, and then flattens out at the highest performance levels c. a curve that drops steadily to the midpoint of test performance and then rises steadily to the highest performance levels (i.e., a U-shaped curve) d. a curve that rises steadily and smoothly to the highest performance levels

D

formula for estimating the internal consistency of a test; calculates reliability when a criterion is dichotomous

KR20 (Kuder-Richardson 20)

statistic used to address inter-rater reliability

Kappa statistic

test item format for attitude scales in which subjects indicate their degree of agreement to statements

Likert format

formula used to correct for half-length (split-half) method and increases the estimate of reliability

Spearman-Brown

an issue when random changes occur over time; can include practice effects

carryover effects

rating-scale format that often uses the categories 1 to 10

category format

generalized method for estimating reliability; useful for tests where there is no single correct answer

coefficient alpha

indication in which the test and the criterion are administered at the same point in time; comes from assessing the simultaneous relationship between a test and criterion

concurrent validity evidence

a range of likely scores within which the correct value is believed to exist

confidence interval

process used to establish the meaning of a test through a series of studies

construct validity evidence

indication that the content of a test represents the conceptual domain it is designed to cover

content validity evidence

indication that a test measures the same attribute as others that supposedly measure the same thing; occurs when there is a high correlation between two or more tests that purport to assess the same criterion

convergent evidence

formula to estimate what the correlation would have been if the variables had been perfectly reliable

correction for attenuation

Reliability estimates are _____.

correlations

indication that a test score corresponds to an accurate measure of interest

criterion validity evidence

test item format in which there are two alternatives for each item

dichotomous format

occurs when one value is subtracted from another; most often calculated after converting to a Z score

difference score

how well an item performs in relation to some criterion; a statistical measure of how well an item on a test differentiates among subgroups of test takers

discriminability analysis

indication that a test measures something different from what other available tests measure

discriminant evidence

set of alternatives on a multiple-choice exam that are not correct

distractors

a limited number of measurements (sample) to describe a larger, more complex construct

domain sampling method


Set pelajaran terkait

Certification Checkpoint Exam #1 (Chapters 1 - 4)

View Set

1.4.R-Lesson: Review for Module 1 Test (Health and P.E.)

View Set