Test Construction EPPP Test questions

Ace your homework & exams now with Quizwiz!

In a study examining the effects of relaxation training on test-taking anxiety, a pre-test measure of anxiety is administered to a group of self-identified highly anxious test takers resulting in a split-half reliability coefficient of .75. If the pre-test is administered to a randomly selected group of the same number of people the split-half reliability coefficient will most likely be: A. Greater than .75 B. Less than .75 C. Equal to .75 D. impossible to predict

Correct Answer is: A A general rule for all correlation coefficients, including reliability coefficients, is that the more heterogeneous the group, i.e., the wider the variability, the higher the coefficient will be. Since a randomly selected group would be more heterogeneous than a group of highly anxious test-takers, the randomly selected group would most likely have a higher reliability coefficient.

The primary purpose of rotation in factor analysis is to A. facilitate interpretation of the data. B. improve the mathematical fit of the solution. C. obtain uncorrelated factors. D. improve the predictive validity of the factors

Correct Answer is: A Factor analysis is a statistical procedure that is designed to reduce measurements on a number of variables to fewer, underlying variables. Factor analysis is based on the assumption that variables or measures highly correlated with each other measure the same or a similar underlying construct, or factor. For example, a researcher might administer 250 proposed items on a personality test and use factor analysis to identify latent factors that could account for variability in responses to the items. These factors would then be interpreted based on logical analysis or the researcher's theories. If one of the factors identified by the analysis correlated highly with items that asked about the person's happiness, level of energy, and hopelessness, that factor might be labeled "Depressive Tendencies." In factor analysis, rotation is usually the final statistical step. Its purpose is to facilitate the interpretation of data by identifying variables that load (i.e., correlate) highly on one factor and not others.

When using a rating scale, several psychologists agree on the same diagnosis for one patient. This is a sign that the scale is A. reliable. B. valid. C. reliable and valid. D. neither reliable nor valid.

Correct Answer is: A The rating scale described by the question has good inter-rater reliability, or consistency across raters. However, it may or may not have good validity; that is, it may or may not measure what it purports to measure. The question illustrates that high reliability is a necessary but not a sufficient condition for high validity.

Discriminant and convergent validity are classified as examples of: A. construct validity. B. content validity C. face validity. D. concurrent validity.

Correct Answer is: A There are many ways to assess the validity of a test. If we correlate our test with another test that is supposed to measure the same thing, we'll expect the two to have a high correlation; if they do, the tests will be said to have convergent validity. If our test has a low correlation with other tests measuring something our test is not supposed to measure, it will be said to have discriminant (or divergent) validity. Convergent and divergent validity are both types of construct validity.

What are the minimum and maximum values of the standard error of measurement? A. 0 and the standard deviation of test scores B. 0 and 1 C. 1 and the standard deviation of test scores D. -1 and 1

Correct Answer is: A This question is best answered with reference to the formula for the standard error of measurement. It is calculated by subtracting 1 by reliability coefficent, and taking the square root of this value; then this is multiplied by the standard deviation of x. You need to know the minimum and maximum values of the reliability coefficient -- 0 and +1.0, respectively. If the reliability coefficient is +1.0, you will find from the above formula that the standard error of measurement is 0, which is its minimum value. And when the reliability coefficient is 0, you find from the formula that the standard error of measurement is equal to the standard deviation of test scores, which is its maximum value.

A common source of criterion contamination is: Select one: A. knowledge of predictor scores by the individual conducting the assessment on the criterion. B. cheating on the criterion test. C. a non-normal distribution of scores on the criterion test. D. a low range of scores on the predictor test.

Correct Answer is: A A criterion measure is one on which a predictor test attempts to predict outcome; it could be termed the "predictee." For example, if scores on a personality test were used to predict job success as measured by supervisor evaluations, the supervisor evaluations would be the criterion measure. Criterion contamination occurs when a factor irrelevant to what is being measured affects scores on the criterion. When the criterion measure is based on subjective ratings, rater knowledge of predictor scores is a common source of criterion contamination. In our example, if supervisors knew employees' results on the personality test, their evaluations might be biased based on their knowledge of these scores.

Which of the following would be used to determine how well an examinee did on a test in terms of a specific standard of performance? A. norm-referenced interpretation B. criterion-referenced interpretation C. domain-referenced interpretation D. objectives-referenced interpretation

Correct Answer is: B There are several ways an examinee's test score can be interpreted. In this question, a criterion-referenced interpretation, an examinee's test performance is interpreted in terms of an external criterion, or standard of performance. In a norm-referenced interpretation, an examinee's test performance is compared to the performance of members of the norm group (other people who have taken the test). Domain-referenced interpretation is used to determine how much of a specific knowledge domain the examinee has mastered. Objectives-referenced interpretation involves interpreting an examinee's performance in terms of achievement of instructional objectives.

Which statement is most correct? A. High reliability assumes high validity. B. High validity assumes high reliability. C. Low validity assumes low reliability. D. Low reliability assumes low validity.

Correct Answer is: B This question is difficult because the language of the response choices is convoluted and imprecise. We don't write questions like this because we're sadistic; it's just that you'll sometimes see this type of language on the exam as well, and we want to prepare you. What you need to do on questions like this is bring to mind what you know about the issue being asked about, and to choose the answer that best applies. Here, you should bring to mind what you know about the relationship between reliability and validity: For a test to have high validity, it must be reliable; however, for a test to have high reliability, it does not necessarily have to be valid. With this in mind, you should see that "high validity assumes high reliability" is the best answer. This means that a precondition of high validity is high reliability. The second best choice states that low reliability assumes low validity. This is a true statement if you interpret the word "assume" to mean "implies" or "predicts." But if you interpret the word "assume" to mean "depends on" or "is preconditioned by," the statement is not correct.

When looking at an item characteristic curve (ICC), which of the following provides information about how well the item discriminates between high and low achievers? Select one: A. the Y-intercept B. the slope of the curve C. the position of the curve (left versus right) D. the position of the curve (top versus bottom)

Correct Answer is: B An item response curve provides one to three pieces of information about a test item - its difficulty "the position of the curve (left versus right)"; its ability to discriminate between high and low scorers or in this case, achievers (correct answer); and the probability of answering the item correctly just by guessing "the Y-intercept".

When conducting a factor analysis, an oblique rotation is preferred when: Select one: A. more than two factors have been extracted. B. the underlying traits are believed to be dependent. C. the assumption of homoscedasticity has been violated. D. the number of factors is equal to the number of tests.

Correct Answer is: B In the context of factor analysis, "oblique" means correlated or dependent. ("Orthogonal" means uncorrelated or independent.)

In computing test reliability, to control for practice effects one would use a(n): I. split-half reliability coefficient. II. alternative forms reliability coefficient. III. test-retest reliability coefficient. Select one: A. I and III only B. I and II only C. II and III only D. II only

Correct Answer is: B The clue here is the practice effect. That means that if you give a test, just taking it will give the person practice so that next time, he or she is not a naive person. To control for that, we want to eliminate the situation where the person is administered the same test again. So we do not use test-retest. We can use the two other methods listed. We can use split-half since, here, only one administration is used (the two parts are thought of as two different tests). And, in the alternative forms method, a different test is given the second time, controlling for the effects of taking the same test twice.

The kappa statistic is used to evaluate reliability when data are: Select one: A. interval or ratio (continuous) B. nominal or ordinal (discontinuous) C. metric D. nonlinear

Correct Answer is: B The kappa statistic is used to evaluate inter-rater reliability, or the consistency of ratings assigned by two raters, when data are nominal or ordinal. Interval and ratio data is sometimes referred to by the term metric.

When constructing an achievement test, which of the following would be useful for comparing total test scores of a sample of examinees to the proportion of examinees who answer each item correctly? Select one: A. classical test theory B. item response theory C. generalizability theory D. item utility theory

Correct Answer is: B The question describes the kind of information that is provided in an item response curve, which is constructed for each item to determine its characteristics when using item response theory as the basis for test development. (Note that there is no such thing as "item utility theory.")

A test developer creates a new test of anxiety sensitivity and correlates it with an existing measure of anxiety sensitivity. The test developer is operating under the assumption that Select one: A. the new test is valid. B. the existing test is valid. C. the new test is reliable. D. the existing test is reliable.

Correct Answer is: B The question is describing an example of obtaining evidence for a test's construct validity. Construct validity refers to the degree to which a test measures a theoretical construct that it purports to measure; anxiety sensitivity is an example of a theoretical construct measured in psychological tests. A high correlation between a new test and an existing test that measures the same construct offers evidence of convergent validity, which is a type of construct validity. Another type is divergent validity, which is the degree to which a test has a low correlation with another test that measures a different construct. Correlating scores on a new test with an existing test to assess the new test's convergent validity requires an assumption that the existing test is valid; i.e., that it actually does measure the construct.

In designing a new test of a psychological construct, you correlate it with an old test the new one will replace. Your assumption in this situation is that: Select one: A. the old test is invalid. B. the old test is valid but out of date. C. the old test is better than the new test. D. the old test and the new test are both culture-fair.

Correct Answer is: B the old test is valid but out of date. This choice is the only one that makes logical sense. In the assessment of the construct validity of a new test, a common practice is to correlate that test with another test that measures the same construct. For this technique to work, the other test must be a valid measure of the construct. So in this situation, it is assumed that the old test is valid, but at the same time, it is being replaced. Of the choices listed, the correct option provides a reason why a valid test would be replaced.

If a predictor test has a validity of 1.0, the standard error of estimate would be A. equal to the standard deviation of the criterion measure. B. 1 C. 0 D. unknown; there is not enough information to answer the question.

Correct Answer is: C A validity coefficient and the standard error of estimate are both measures of the accuracy of a predictor test. A validity coefficient is the correlation between scores on a predictor and a criterion (outcome) measure. A coefficient of 1.0 reflects a perfect correlation; it would mean that one would always be able to perfectly predict, without error, the scores on the outcome measure. The standard error of estimate indicates how much error one can expect in the prediction or estimation process. If a predictor test has perfect validity, there would be no error of estimate; you would always know the exact score on the outcome measure just from the score on the predictor. Therefore, the closer the validity coefficient is to 1.0, the smaller the value of the standard error of estimate, and if the coefficient were 1.0, the standard error of estimate would be 0.

he purpose of rotation in factor analysis is to facilitate interpretation of the factors. Rotation: A. alters the factor loadings for each variable but not the eigenvalue for each factor B. alters the eigenvalue for each factor but not the factor loadings for the variables C. alters the factor loadings for each variable and the eigenvalue for each factor D. does not alter the eigenvalue for each factor nor the factor loadings for the variables

Correct Answer is: C In factor analysis, rotating the factors changes the factor loadings for the variables and eigenvalue for each factor although the total of the eigenvalues remains the same.

A researcher employs multiple methods of measurement in an attempt to increase reliability by reducing systematic error. This strategy is referred to as: A. calibration B. intraclass correlation (ICC) C. triangulation D. correction for attenuation

Correct Answer is: C Triangulation is the attempt to increase reliability by reducing systematic or method error through a strategy in which the researcher employs multiple methods of measurement (e.g., observation, survey, archival data). If the alternative methods do not share the same source of systematic error, examination of data from the alternative methods gives insight into how individual scores may be adjusted to come closer to reflecting true scores, thereby increasing reliability.

R-squared is used as an indicator of: A. The number of values that are free to vary in a statistical calculation B. The variability of scores C. How much your ability to predict is improved using the regression line D. The relationship between two variables that have a nonlinear relationship

Correct Answer is: C You might have been able to guess correctly using the process of elimination. If so, note that R-squared tells you how much your ability to predict is improved using the regression line, compared to not using it. The most possible improvement is 1 and the least is 0. The number of values that are free to vary in a statistical calculation This choice is the definition of degrees of freedom. The variability of scores This is the definition of variance. The relationship between two variables that have a nonlinear relationship And this is a description of the coefficient eta.

A predictor that is highly sensitive for identifying the presence of a disorder would most likely result in: Select one: A. measurement error B. type II error C. a high number of false positives D. a high number of false negatives

Correct Answer is: C A predictor that is highly sensitive will more likely identify the presence of a characteristic; that is, it will result in more positives (true and false). This may be desirable when the risk of not detecting a problem is high. For example, in the detection of cancer, a blood test that results in a high number of false positives is preferable to one that has many false negatives. A positive test result can then be verified by another method, for example, a biopsy. Measurement error is the part of test scores which is due to random factors. Type II error is an error made when an experimenter erroneously accepts the null hypothesis.

Which statement is most true about validity? Select one: A. Validity is never higher than the reliability coefficient. B. Validity is never higher than the square of the reliability coefficient. C. Validity is never higher than the square root of the reliability coefficient. D. Validity is never higher than 1 minus the reliability coefficient.

Correct Answer is: C A test's reliability sets an upper limit on its criterion-related validity. Specifically, a test's validity coefficient can never be higher than the square root of its reliability coefficient. In practice, a validity coefficient will never be that high, but, theoretically, that's the upper limit.

Likert scales are most useful for: Select one: A. dichotomizing quantitative data B. quantifying objective data C. quantifying subjective data D. ordering categorical data

Correct Answer is: C Attitudes are subject phenomena. Likert scales indicate the degree to which a person agrees or disagrees with an attitudinal statement. Using a Likert scale, attitudes are quantified - or represented in terms of ordinal scores.

An examinee obtains a score of 70 on a test that has a mean of 80, a standard deviation of 15, and a standard error of measurement of 5. The 95% confidence interval for the examinee's score is: Select one: A. 50-90 B. 55-85 C. 60-80 D. 65-75

Correct Answer is: C Confidence interval indicates the range within which an examinees' true score is likely to fall, given his or her obtained score. The standard error of measurement indicates how much error an individual test score can be expected to have and is used to construct confidence intervals. To calculate the 68% confidence interval, add and subtract one standard error of measurement to the obtained score. To calculate the 95% confidence interval, add and subtract two standard errors of measurement to the obtained score. Two standard errors of measurement in this case equal 10. We're told that the examinee's obtained score is 70. 70 + 10 results in a confidence interval of 80 to 100. In other words, we can be 95% confident that the examinee's true score falls within 60 and 80.

In a study assessing the predictive validity of the SAT test to predict college success, it is found the SAT scores have a statistically significant correlation of .47 with the criterion, first year college GPA. A follow-up study separating the data by gender finds that for a given SAT score, the predicted GPA scores are higher for women than for men. This situation is most clearly an example of Select one: A. single group validity. B. differential validity. C. differential prediction. D. adverse impact.

Correct Answer is: C Differential prediction is a bit of a technical term, but in a non-technical way, it can be defined as a case where given scores on a predictor test predict different outcomes for different subgroups. Using the example in the question: if the average predicted GPA for men scoring 500 on the verbal SAT was 2.7, the average predicted GPA for females with the same SAT score was 3.3, and this type of difference is statistically significant across scores on the SAT, then use of the SAT would result in differential prediction based on gender.

Which of the following would be used to measure the internal consistency of a test? Select one: A. kappa coefficient B. test-retest reliability C. split-half reliability D. alternate forms reliability

Correct Answer is: C Internal consistency is one of several types of reliability. As its name implies, it is concerned with the consistency within a test, that is, the correlations among the different test items. Split-half reliability is one of the measures of internal consistency and involves splitting a test in two and correlating the two halves with each other. Other measures of internal (inter-item) consistency are the Kuder-Richardson Formula 20 (for dichotomously scored items) and Cronbach's coefficient alpha (for multiple-scored items). Test-retest reliability is not concerned with internal consistency, but rather, the stability of a test over time, and uses the correlations of scores between different administrations of the same test. Alternative forms reliability is concerned with the equivalence of different versions of a test. And the kappa coefficient is used as a measure of inter-rater reliability, that is, the amount of agreement between two raters.

If an examinee correctly guesses the answers to a test, the reliability coefficient: Select one: A. is not affected B. stays the same C. decreases D. increases

Correct Answer is: C One of the factors that affect the reliability coefficient is guessing. Guessing correctly decreases the reliability coefficient. The incorrect options ("is not affected," "stays the same," and "increases") are not true in regards to the reliability coefficient.

A way to define criterion in regard to determining criterion related validity is that the criterion is: Select one: A. The predictor test B. The validity measure C. The predictee D. The content.

Correct Answer is: C To determine criterion-related validity, scores on a predictor test are correlated with an outside criteria. The criteria is that which is being predicted, or the "predictee."

Determining test-retest reliability would be most appropriate for which of the following types of tests? A. brief B. speed C. state D. trait

Correct Answer is: D As the name implies, test-retest reliability involves administering a test to the same group of examinees at two different times and then correlating the two sets of scores. This would be most appropriate when evaluating a test that purports to measure a stable trait, since it should not be significantly affected by the passage of time between test administrations.

Computer-adaptive testing will yield Select one: A. more accurate results for high scorers on a test. B. more accurate results for low scorers on a test. C. more accurate results for examinees who score in the middle range of a test. D. equally accurate results across all range of scores on a test.

Correct Answer is: D In computerized adaptive testing, the examinee's previous responses are used to tailor the test to his or her ability. As a result, inaccuracy of scores is reduced across ability levels.

Which of the following would be used to determine the probability that examinees of different ability levels are able to answer a particular test item correctly? Select one: A. criterion-related validity coefficient B. item discrimination index C. item difficulty index D. item characteristic curve

Correct Answer is: D Item characteristic curves (ICCs), which are associated with item response theory, are graphs that depict individual test items in terms of the percentage of individuals in different ability groups who answered the item correctly. For example, an ICC for an individual test item might show that 80% of people in the highest ability group, 40% of people in the middle ability group, and 5% of people in the lowest ability group answered the item correctly. Although costly to derive, ICCs provide much information about individual test items, including their difficulty, discriminability, and probability that the item will be guessed correctly.

As a measure of test reliability, an internal consistency coefficient would be least appropriate for a Select one: A. vocational aptitude test. B. intelligence test. C. power test. D. speed test.

Correct Answer is: D Tests can be compared to each other in terms of whether they emphasize power or speed. A pure speed test contains relatively easy items and has a strict time limit; it is designed to measure examinees' speed of response. A pure power test supplies enough time for most examinees to finish and contains items of varying difficulty. Power tests are designed to assess examinees' knowledge or ability in whatever content domain is being measured. Many tests measure both power and speed. An internal consistency reliability coefficient measures the correlation of responses to different items within the same test. On a pure speed test, all items answered are likely to be correct. As a result, the correlation between responses is artificially inflated; therefore, for speed tests, other measures of reliability, such as test-retest or alternate forms, are more appropriate.

Kuder-Richardson reliability applies to Select one: A. split-half reliability. B. test-retest stability. C. Likert scales. D. tests with dichotomously scored questions.

Correct Answer is: D The Kuder-Richardson formula is one of several statistical indices of a test's internal consistency reliability. It is used to assess the inter-item consistency of tests that are dichotomously scored (e.g., scored as right or wrong).

The factor loading for Test A and Factor II is .80 in a factor matrix. This means that: Select one: A. only 80% of variability in Test A is accounted for by the factor analysis B. only 64% of variability in Test A is accounted for by the factor analysis C. 80% of variability in Test A is accounted for by Factor II D. 64% of variability in Test A is accounted for by Factor II

Correct Answer is: D The correlation coefficient for a test and an identified factor is referred to as a factor loading. To obtain a measure of shared variability, the factor loading is squared. This example, the factor loading is .80, meaning that 64% (.80 squared) of variability in the test is accounted for by the factor. The other identified factor(s) probably also account for some variability in Test A, which is why this option is not the best answer: only 64% of variability in Test A is accounted for by the factor analysis.

In a clinical trial of a new drug, the null hypothesis is the new drug is, on average, no better than the current drug. It is concluded that the two drugs produce the same effect when in fact the new drug is superior. This is: Select one: A. corrected by reducing the power of the test B. corrected by reducing the sample size C. a Type I error D. a Type II error

Correct Answer is: D Type II errors occur when the null hypothesis is not rejected when it is in fact false; Type I errors are often considered more serious as the null hypothesis is wrongly rejected. For example, in the clinical trial of a new drug, this would be concluding that the new drug was better when in fact it was not. Type I and II errors are inversely related: as the probability of a Type I error increases, the probability of a Type II error decreases, and vice versa.

In the multitrait-multimethod matrix, a large heterotrait-monomethod coefficient would indicate: Select one: A. low convergent validity. B. high convergent validity. C. high divergent validity. D. low divergent validity.

Correct Answer is: D Use of a multitrait-multimethod matrix is one method of assessing a test's construct validity. The matrix contains correlations among different tests that measure both the same and different traits using similar and different methodologies. The heterotrait-monomethod coefficient, one of the correlation coefficients that would appear on this matrix, reflects the correlation between two tests that measure different traits using similar methods. An example might be the correlation between a test of depression based on self-report data and a test of anxiety also based on self-report data. If a test has good divergent validity, this correlation would be low. Divergent validity is the degree to which a test has a low correlation with other tests that do not measure the same construct. Using the above example, a test of depression would have poor divergent validity if it had a high correlation with other tests that purportedly measure different traits, such as anxiety. This would be evidence that the depression test is measuring traits that are unrelated to depression.

Differential prediction is one of the causes of test unfairness and occurs when: Select one: A. members of one group obtain lower scores on a selection test than members of another group, but the difference in scores is not reflected in their scores on measures of job performance B. a rater's knowledge of ratees' performance on the predictor biases his/her ratings of ratees' performance on the criterion C. a predictor's validity coefficient differs for different groups D. a test has differential validity

Correct Answer is: A As described in the Federal Uniform Guidelines on Employee Selection, differential prediction is a potential cause of test unfairness. Differential prediction occurs when the use of scores on a selection test systematically over- or under-predict the job performance of members of one group as compared to members of another group. a rater's knowledge of ratees' performance on the predictor biases his/her ratings of ratees' performance on the criterion Criterion contamination occurs when a rater's knowledge of ratees' performance on the predictor biases his/her ratings of ratees' performance on the criterion. a predictor's validity coefficient differs for different groups Differential validity, also a possible cause of adverse impact, occurs when a predictor's validity coefficient differs for different groups. a test has differential validity When a test has differential validity, there is a slope bias. Slope bias refers to differences in the slope of the regression line.

A company wants its clerical employees to be very efficient, accurate and fast. Examinees are given a perceptual speed test on which they indicate whether two names are exactly identical or slightly different. The reliability of the test would be best assessed by: Select one: A. test-retest B. Cronbach's coefficient alpha C. split-half D. Kuder-Richardson Formula 20

Correct Answer is: A Perceptual speed tests are highly speeded and are comprised of very easy items that every examinee, it is assumed, could answer correctly with unlimited time. The best way to estimate the reliability of speed tests is to administer separately timed forms and correlate these, therefore using a test-retest or alternate forms coefficient would be the best way to assess the reliability of the test in this question. The other response choices are all methods for assessing internal consistency reliability.

Which of the following descriptive words for tests are most opposite in nature? Select one: A. speed and power B. subjective and aptitude C. norm-referenced and standardized D. maximal and ipsative

Correct Answer is: A Pure speed tests and pure power tests are opposite ends of a continuum. A speed test is one with a strict time limit and easy items that most or all examinees are expected to answer correctly. Speed tests measure examinees' response speed. A power test is one with no or a generous time limit but with items ranging from easy to very difficult (usually ordered from least to most difficult). Power tests measure level of content mastered.

Cluster analysis would most likely be used to Select one: A. construct a "taxonomy" of criminal personality types. B. obtain descriptive information about a particular case. C. test the hypothesis that an independent variable has an effect on a dependent variable. D. test statistical hypotheses when the assumption of independence of observations is violated.

Correct Answer is: A The purpose of cluster analysis is to place objects into categories. More technically, the technique is designed to help one develop a taxonomy, or classification system of variables. The results of a cluster analysis indicate which variables cluster together into categories. The technique is sometimes used to divide a population of individuals into subtypes.

If, in a normally-shaped distribution, the mean is 100 and the standard error of measurement is 10, what would the 68% confidence interval be for an examinee who receives a score of 95? Select one: A. 85 to 105 B. 90 to 100 C. 90 to 110 D. impossible to calculate without the reliability coefficient

Correct Answer is: A The standard error of measurement indicates how much error an individual test score can be expected to have. A confidence interval indicates the range within which an examinees's true score is likely to fall, given his or her obtained score. To calculate the 68% confidence interval we simply add and subtract one standard error of measurement to the obtained score. impossible to calculate without the reliability coefficient This choice is incorrect because although the reliability coefficient is needed to calculate a standard error of measurement, in this case, we are provided with the standard error.

Raising the cutoff score on a predictor test would have the effect of A. increasing true positives B. decreasing false positives C. decreasing true negatives D. decreasing false negatives.

Correct Answer is: B A simple way to answer this question is with reference to a chart such as the one displayed under the topic "Criterion-Related Validity" in the Psychology-Test Construction section of your materials. If you look at this chart, you can see that increasing the predictor cutoff score (i.e., moving the vertical line to the right) decreases the number of false positives as well as true positives (you can also see that the number of both true and false negatives would be increased).You can also think about this question more abstractly by coming up with an example. Imagine, for instance, that a general knowledge test is used as a predictor of job success. If the cutoff score on this test is raised, fewer people will score above this cutoff and, therefore, fewer people will be predicted to be successful. Another way of saying this is that fewer people will come up "positive" on this predictor. This applies to both true positives and false positives.

The item difficulty ("p") index yields information about the difficulty of test items in terms of a(n) _________ scale of measurement. A. nominal B. ordinal C. interval D. ratio

Correct Answer is: B An item difficulty index indicates the percentage of individuals who answer a particular item correctly. For example, if an item has a difficulty index of .80, it means that 80% of test-takers answered the item correctly. Although it appears that the item difficulty index is a ratio scale of measurement, according to Anastasi (1982) it is actually an ordinal scale because it does not necessarily indicate equivalent differences in difficulty.

If, in a normally-shaped distribution, the mean is 100 and the standard error of measurement is 5, what would the 95% confidence interval be for an examinee who receives a score of 90? A. 75-105 B. 80-100 C. 90-100 D. 95-105

Correct Answer is: B The standard error of measurement indicates how much error an individual test score can be expected to have. A confidence interval indicates the range within which an examinee's true score is likely to fall, given his or her obtained score. To calculate the 95% confidence interval we simply add and subtract two standard errors of measurement to the obtained score. Two standard errors of measurement in this case equal 10. We're told that the examinee's obtained score is 90. 90 +/- 10 results in a confidence interval of 80 to 100. In other words, we can be 95% confident that the examinee's true score falls within 80 and 100.

A person obtains a raw score of 70 on a Math test with a mean of 50 and an SD of 10; a percentile rank of 84 on a History test; and a T-score of 65 on an English test. What is the relative order of each of these scores? Select one: A. History >> Math >> English B. Math >> History >> English C. History >> English >> Math D. Math >> English >> History

Correct Answer is: D Before we can compare different forms of scores, we must transform them into some form of standardized measure. A Math test which has a mean of 50 and an SD of 10 indicates that a raw score of 70 would fall 2 standard deviations above the mean. Assuming a normal distribution of scores, a percentile rank of 84 on a History test is equivalent to 1 standard deviation above the mean. If you haven't memorized that, you could still figure it out: Remember that 50% of all scores in a normal distribution fall below the mean and 50% fall above the mean. And 68% of scores fall within +/- 1 SD of the mean. If you divide 68% by 2 - you get 34% (the percentage of scores that fall between 0 and +1 SD). If you then add that 34% to the 50% that fall below the mean - you get a percentile rank of 84. Thus, the 84 percentile score is equivalent to 1 SD above the mean. Finally, looking at the T-score on the English test - we know that T-scores always have a mean of 50 and an SD of 10. Thus a T-score of 65 is equivalent to 1½ standard deviations above the mean. Comparing the 3 test scores we find the highest score was in Math at 2 SDs above the mean, followed by English at 1½ SDs above the mean, and History at 1 SD above the mean.

A negative item discrimination (D) indicates: A. an index equal to zero. B. more high-achieving examinees than low-achieving examinees answered the item correctly. C. an item was answered correctly by the same number of low- and high-achieving students. D. more low-achieving examinees answered the item correctly than high-achieving.

Correct Answer is: D The discrimination index, D, has a value range from +1.0 to -1.0 and is the number of people in the upper or high scoring group who answered the item correctly minus the number of people in the lower scoring group who answered the item correctly, divided by the number of people in the largest of the two groups. An item will have a discrimination index equal to zero if everyone gets it correct or incorrect. A negative item discrimination index indicates that the item was answered correctly by more low-achieving students than by high-achieving students. In other words, a poor student may make a guess, select that response, and come up with the correct answer without any real understanding of what is being assessed. Whereas good students (like EPPP candidates) may be suspicious of a question that looks too easy, may read too much into the question, and may end up being less successful than those who guess.

The slope of the item response curve, with respect to item response theory, indicates an item's: A. reliability B. validity Incorrect C. difficulty D. discriminability

Correct Answer is: D The item response curve provides information about an item's difficulty; ability to discriminate between those who are high and low on the characteristic being measured; and the probability of correctly answering the item by guessing. The position of the curve indicates its difficulty* and the steeper the slope of the item response curve, the better its ability to discriminate (correct response) between examinees who are high and low on the characteristic being measured. The item response curve does not indicate reliability* or validity* (* incorrect options).

If someone achieves a score of 30 on a standardized test with a standard error of measurement of 5 points, what is the approximate probability that the person's true score is between 20 and 40? A. 43526 B. 43589 C. 43747 D. 19/20

Correct Answer is: D The range given in the question encompasses the obtained score plus and minus two standard errors of measurement. That is, the standard error of measurement is 5, and the range is 10 points (2 X 5) below the obtained score to 10 points above the obtained score. One thing you should memorize is that there is about a 95% probability that a person's true score lies within two standard errors of measurement of the obtained score. But even if you knew that, you might have had trouble with this question, since the choices are given as fractions, not percentages. To answer it correctly, you needed to figure out that 19/20 is the fractional equivalent of 95%.

In the multitrait-multimethod matrix, a low heterotrait-monomethod coefficient would indicate: A. low convergent validity. B. low divergent validity. C. high convergent validity. D. high divergent validity.

Correct Answer is: D Use of a multitrait-multimethod matrix is one method of assessing a test's construct validity. The matrix contains correlations among different tests that measure both the same and different traits using similar and different methodologies. The heterotrait-monomethod coefficient, one of the correlation coefficients that would appear on this matrix, reflects the correlation between two tests that measure different traits using similar methods. An example might be the correlation between a test of depression based on self-report data and a test of anxiety also based on self-report data. If a test has good divergent validity, this correlation would be low. Divergent validity is the degree to which a test has a low correlation with other tests that do not measure the same construct. Using the above example, a test of depression would have good divergent validity if it had a low correlation with other tests that purportedly measure different traits, such as anxiety. This would be evidence that the depression test is not measuring traits that are unrelated to depression.

If a job selection test has lower validity for Hispanics as compared to White or African-Americans, you could say that ethnicity is acting as a: Select one: A. confounding variable B. criterion contaminator C. discriminant variable D. moderator variable

Correct Answer is: D A moderator variable is any variable which moderates, or influences, the relationship between two other variables. If the validity of a job selection test is different for different ethnic groups (i.e. there is differential validity), then ethnicity would be considered a moderator variable since it is influencing the relationship between the test (predictor) and actual job performance (the criterion). A confounding variable is a variable in a research study which is not of interest to the researcher, but which exerts a systematic effect on the DV. Criterion contamination is the artificial inflation of validity which can occur when raters subjectively score ratees on a criterion measure after they have been informed how the ratees scored on the predictor.

Confidence intervals are used in order to: Select one: A. calculate the test's mean B. calculate the standard deviation C. calculate the standard error of measurement D. estimate true scores from obtained scores

Correct Answer is: D Confidence intervals allow us to determine the range within which an examinee's true score on a test is likely to fall, given his or her obtained score. The standard error of measurement is used to construct confidence intervals, not the other way around.

Test Construction EPPP Test questions

Related study sets

Exam 2 study guide

ENY 6706 quizzes 9-15

Advanced Aircraft Systems Quiz 1

PCS QUIZZES AND ANSWERS

cna

English Lord of the Flies

Course 15

BWZ Kap. 9 Finanzierung, Banken, Börse und Kapitalanlage (Wichtig)

15.6 Platform Scripting

Chapter 7

COMS 101 Organizing Your Speech in a Pattern

Week 7 PEQs

ISA 235 Chapter 5

chapter 19 terms

SCIM 333 Exam 1 Chapters 1-6

THE FREEDMEN'S BUREAU

Review Terms

Xfinity Hotspot

Ch 39 prep u

ECO 301 Final Connect