Test Construction- Exam Qs

¡Supera tus tareas y exámenes ahora con Quizwiz!

A common source of criterion contamination is: Select one: A.knowledge of predictor scores by the individual conducting the assessment on the criterion B.cheating on the criterion test. C.a non-normal distribution of scores on the criterion test. D.a low range of scores on the predictor test.

Correct Answer is: A A criterion measure is one on which a predictor test attempts to predict outcome; it could be termed the "predictee." For example, if scores on a personality test were used to predict job success as measured by supervisor evaluations, the supervisor evaluations would be the criterion measure. Criterion contamination occurs when a factor irrelevant to what is being measured affects scores on the criterion. When the criterion measure is based on subjective ratings, rater knowledge of predictor scores is a common source of criterion contamination. In our example, if supervisors knew employees' results on the personality test, their evaluations might be biased based on their knowledge of these scores.

A teacher seeks consultation from an educational psychologist in order to pinpoint a child's specific problems in mathematics ability. The psychologist will most likely utilize a: Select one: A.domain-referenced test B.norm-referenced test C.criterion-referenced test D.predictive-referenced test

Correct Answer is: A A domain-referenced test draws items from a very precisely defined content area or domain and can identify specific strengths and weaknesses. For example, a domain-referenced test might be limited to questions assessing the ability to add two three-digit whole numbers rather than multiple areas such as arithmetic, subtraction, multiplication, and division. norm-referenced test Most achievement tests in schools use norm-referenced tests which differentiate students at different levels of achievement but do not identify specific strengths or weaknesses. criterion-referenced test Criterion-referenced tests are used to interpret test performance based on whether or not students reach an established criterion (e.g. all students will reach a 90% accuracy level).

Discriminant and convergent validity are classified as examples of: Select one: A.construct validity. B.content validity C.face validity. D.concurrent validity.

Correct Answer is: A There are many ways to assess the validity of a test. If we correlate our test with another test that is supposed to measure the same thing, we'll expect the two to have a high correlation; if they do, the tests will be said to have convergent validity. If our test has a low correlation with other tests measuring something our test is not supposed to measure, it will be said to have discriminant (or divergent) validity. Convergent and divergent validity are both types of construct validity.

To determine two rater's level of agreement on a test you would use: Select one: A.Kappa coefficient B.Discriminant validity C.Percentage of agreement D.Item response theory

Correct Answer is: A There are a number of ways to estimate the interscorer reliability, but the most common involves calculating a correlation coefficient between the scores of two different raters. The Kappa coefficient is a measure of the agreement between two judges who each rate a set of objects using the nominal scales.

Rotation is used in factor analysis to: A.get an easier pattern of factor loadings to interpret. B.increase the magnitude of the communalities. C.reduce the magnitude of the communalities. D.reduce the effects of measurement error on the factor loadings.

Correct Answer is: A Factors are rotated to obtain a pattern that's easier to interpret since the pattern of factor loadings in the initial factor matrix is often difficult to interpret. Rotation alters the magnitude of the factor loadings but not the magnitude of the communalities ("increase the magnitude of the communalities" and "reduce the magnitude of the communalities") and does not reduce the effects of measurement error ("reduce the effects of measurement error on the factor loadings").

Determining test-retest reliability would be most appropriate for which of the following types of tests? Select one: A.brief B.speed C.state D.trait

Correct Answer is: D As the name implies, test-retest reliability involves administering a test to the same group of examinees at two different times and then correlating the two sets of scores. This would be most appropriate when evaluating a test that purports to measure a stable trait, since it should not be significantly affected by the passage of time between test administrations.

Rotation is used in factor analysis to: Select one: A.get an easier pattern of factor loadings to interpret. B.increase the magnitude of the communalities. C.reduce the magnitude of the communalities. D.reduce the effects of measurement error on the factor loadings.

Limited "floor" would be the biggest problem when a test will be used to Select one: A.distinguish between mildly and moderately intellectually disabled children. B.distinguish between above-average and gifted students. C.distinguish between successful and unsuccessful trainees. D.distinguish between satisfied and dissatisfied customers.

Correct Answer is: A Floor refers to a test's ability to distinguish between examinees at the low end of the distribution, which would be an issue when distinguishing between those with mild versus moderate intellectual disability. A limited floor occurs when the test does not contain enough easy items. Note that "ceiling" would be of concern for tests designed to distinguish between examinees at the high end of the distribution: distinguish between above-average and gifted students.

A measure of relative strength of a score within an individual is referred to as a(n): Select one: A.ipsative score B.normative score C.standard score D.independent variable

Correct Answer is: A Ipsative scores report an examinee's scores using the examinee him or herself as a frame of reference. They indicate the relative strength of a score within an individual but, unlike normative measures, do not provide the absolute strength of a domain relative to a normative group. Examples of ipsative scores are the results of a forced choice measure.

The primary purpose of rotation in factor analysis is to: A.facilitate interpretation of the data. B.improve the mathematical fit of the solution. C.obtain uncorrelated factors. D.improve the predictive validity of the factors.

Correct Answer is: A Factor analysis is a statistical procedure that is designed to reduce measurements on a number of variables to fewer, underlying variables. Factor analysis is based on the assumption that variables or measures highly correlated with each other measure the same or a similar underlying construct, or factor. For example, a researcher might administer 250 proposed items on a personality test and use factor analysis to identify latent factors that could account for variability in responses to the items. These factors would then be interpreted based on logical analysis or the researcher's theories. If one of the factors identified by the analysis correlated highly with items that asked about the person's happiness, level of energy, and hopelessness, that factor might be labeled "Depressive Tendencies." In factor analysis, rotation is usually the final statistical step. Its purpose is to facilitate the interpretation of data by identifying variables that load (i.e., correlate) highly on one factor and not others.

In a study examining the effects of relaxation training on test-taking anxiety, a pre-test measure of anxiety is administered to a group of self-identified highly anxious test takers resulting in a split-half reliability coefficient of .75. If the pre-test is administered to a randomly selected group of the same number of people the split-half reliability coefficient will most likely be: Select one: A.Greater than .75 B.Less than .75 C.Equal to .75 D.impossible to predict

Correct Answer is: A A general rule for all correlation coefficients, including reliability coefficients, is that the more heterogeneous the group, i.e., the wider the variability, the higher the coefficient will be. Since a randomly selected group would be more heterogeneous than a group of highly anxious test-takers, the randomly selected group would most likely have a higher reliability coefficient.

A large monotrait-heteromethod coefficient in a multitrait-multimethod matrix indicates evidence of: A.convergent validity B.concurrent validity C.predictive validity D.discriminant validity

Correct Answer is: A A multitrait-multimethod matrix is a complicated method for assessing convergent and discriminant validity. Convergent validity requires that different ways of measuring the same trait yield the same result. Monotrait-heteromethod coefficients are correlations between two measures that assess the same trait using different methods; therefore if a test has convergent validity, this correlation should be high. Heterotrait-monomethod and heterotrait-heteromethod, both confirm discriminatory validity, and monotrait-monomethod coefficients are reliability coefficients.

A school psychologist is asked to work with a child whose on-task behavior is poor. To monitor the child's on-task behavior, the psychologist is most likely to train the teacher or teacher's assistant in Select one: A.interval recording. B.frequency recording. C.continuous recording. D.duration recording.

Correct Answer is: A All the choices refer to methods of recording behaviors that can be used by observational raters or researchers. In interval recording (the correct answer), the rater observes a subject at given intervals and notes whether or not the subject is engaging in the target behavior during that interval. For instance, a rater might observe a student for 10 seconds every three minutes and record whether on not the student is on-task during those 10 seconds. Interval recording is most useful for behaviors that do not have a fixed beginning or end -- such as being on task. Frequency recording involves keeping count of the number of times a behavior occurs; this would not be practical in keeping track of whether or not a person is on task. Continuous recording involves recording all the behaviors of the target subject during each observation session. Although it's possible to keep track of whether a person is on-task using this method, it is not as practical or meaningful for this purpose as interval recording. Finally, duration recording involves recording the elapsed time during which the target behavior or behaviors occur. This would not be practical for a behavior that has no fixed beginning or end.

Differential prediction is one of the causes of test unfairness and occurs when: A.members of one group obtain lower scores on a selection test than members of another group, but the difference in scores is not reflected in their scores on measures of job performance B.a rater's knowledge of ratees' performance on the predictor biases his/her ratings of ratees' performance on the criterion C.a predictor's validity coefficient differs for different groups D.a test has differential validity

Correct Answer is: A As described in the Federal Uniform Guidelines on Employee Selection, differential prediction is a potential cause of test unfairness. Differential prediction occurs when the use of scores on a selection test systematically over- or under-predict the job performance of members of one group as compared to members of another group. a rater's knowledge of ratees' performance on the predictor biases his/her ratings of ratees' performance on the criterion Criterion contamination occurs when a rater's knowledge of ratees' performance on the predictor biases his/her ratings of ratees' performance on the criterion. a predictor's validity coefficient differs for different groups Differential validity, also a possible cause of adverse impact, occurs when a predictor's validity coefficient differs for different groups. a test has differential validity When a test has differential validity, there is a slope bias. Slope bias refers to differences in the slope of the regression line.

A factory requires all job applicants to complete a Biographical Information Blank (BIB) which asks, among other things, for details about the applicant's personal interests and skills. Many of the applicants, upon seeing the test, become very angry and subsequently file a class action suit against the company. The problem with this test seems to be that it lacks: Select one: A.face validity B.content validity C.construct validity D.predictive validity

Correct Answer is: A Biographical Information Blanks (BIB) have actually been found to be highly predictive of job success and only slightly less valid than cognitive ability tests for predicting job performance. However, they often lack face validity since some of the questions do not appear to the applicants to have anything to do with job performance.

The primary purpose of rotation in factor analysis is to Select one: A.facilitate interpretation of the data. B.improve the mathematical fit of the solution. C.obtain uncorrelated factors. D.improve the predictive validity of the factors.

Adding more items to a test would most likely: Select one: A.increase the test's reliability B.decrease the test's validity C.have no effect on the test's reliability or validity D.preclude the use of the Spearman-Brown prophecy formula

Correct Answer is: A Lengthening a test, that is, adding more test items, generally results in an increase in the test's reliability. For example, a test consisting of only 3 questions would probably be more reliable if we added 10 more items. The Spearman-Brown formula is specifically used to estimate the reliability of a test if it were lengthened or shortened.

All of the following are norm-referenced scores except: Select one: A.pass/fail B.grade-equivalent scores C.T-score D.percentile rank

Correct Answer is: A Norm-referenced scores indicate how well an individual performed on a test compared to others in the norm group. A pass or fail score achieved by one individual does not indicate how many others passed or failed. Pass/fail is a criterion-referenced score, which indicates if an individual knows the exam content or not, but does not measure performance relative to other examinees. The other three responses are norm-referenced scores. A grade-equivalent score* permits a test user to compare an individual's exam performance to others in different grade levels. A T-score* is a type of standard score, or norm-referenced scores indicating how a test-taker performed in terms of standard deviation units from the mean score of the norm group. A percentile rank* shows the percent of individuals in the norm group who scored lower (* incorrect options).

If you find that your job selection measure yields too many "false positives," what could you do to correct the problem? Select one: A.raise the predictor cutoff score and/or lower the criterion cutoff score B.raise the predictor cutoff score and/or raise the criterion cutoff score C.lower the predictor cutoff score and/or raise the criterion cutoff score D.lower the predictor cutoff score and/or lower the criterion cutoff score

Correct Answer is: A On a job selection test, a "false positive" is someone who is identified by the test as successful but who does not turn out to be successful, as measured by a performance criterion. If you raise the selection test cutoff score, you will reduce false positives, since, by making it harder to "pass" the test, you will be ensuring that the people who do pass are more qualified and therefore more likely to be successful. By lowering the criterion score, what you are in effect doing is making your definition of success more lax. It therefore becomes easier to be considered successful, and many of the people who were false positives will now be considered true positives.

A company wants its clerical employees to be very efficient, accurate and fast. Examinees are given a perceptual speed test on which they indicate whether two names are exactly identical or slightly different. The reliability of the test would be best assessed by: A.test-retest B.Cronbach's coefficient alpha C.split-half D.Kuder-Richardson Formula 20

Correct Answer is: A Perceptual speed tests are highly speeded and are comprised of very easy items that every examinee, it is assumed, could answer correctly with unlimited time. The best way to estimate the reliability of speed tests is to administer separately timed forms and correlate these, therefore using a test-retest or alternate forms coefficient would be the best way to assess the reliability of the test in this question. The other response choices are all methods for assessing internal consistency reliability. These are useful when a test is designed to measure a single characteristic, when the characteristic measured by the test fluctuates over time, or when scores are likely to be affected by repeated exposure to the test. However, they are not appropriate for assessing the reliability of speed tests because they tend to produce spuriously high coefficients.

A company wants its clerical employees to be very efficient, accurate and fast. Examinees are given a perceptual speed test on which they indicate whether two names are exactly identical or slightly different. The reliability of the test would be best assessed by: Select one: A.test-retest B.Cronbach's coefficient alpha C.split-half D.Kuder-Richardson Formula 20

Item analysis is a procedure used to: Select one: A.Determine which items will be retained for the final version of the test B.Refer to the degree to which items differentiate among examinees C.Graph depictions of percentages of people D.Help the IRS with an audit

Correct Answer is: A The procedure used to determine what items will be retained for the final version of a test is the definition of item analysis. The degree to which items discriminate among examinees is the definition of Item Discrimination. A graph that depicts percentages of people is an item characteristic curve.

Cluster analysis would most likely be used to Select one: A.construct a "taxonomy" of criminal personality types. B.obtain descriptive information about a particular case. C.test the hypothesis that an independent variable has an effect on a dependent variable. D.test statistical hypotheses when the assumption of independence of observations is violated.

Correct Answer is: A The purpose of cluster analysis is to place objects into categories. More technically, the technique is designed to help one develop a taxonomy, or classification system of variables. The results of a cluster analysis indicate which variables cluster together into categories. The technique is sometimes used to divide a population of individuals into subtypes.

When using a rating scale, several psychologists agree on the same diagnosis for one patient. This is a sign that the scale is Select one: A.reliable. B.valid. C.reliable and valid. D.neither reliable nor valid.

Correct Answer is: A The rating scale described by the question has good inter-rater reliability, or consistency across raters. However, it may or may not have good validity; that is, it may or may not measure what it purports to measure. The question illustrates that high reliability is a necessary but not a sufficient condition for high validity.

Different regression line slopes in a scatterplot suggests: Select one: A.differential validity B.a lack of factorial validity C.divergent validity D.a lack of convergent validity

Correct Answer is: A The slope of a regression line for a test is directly related to the test's criterion-related validity: The steeper the slope, the greater the validity. A test has differential validity when it has different validity coefficients for different groups, which is what is suggested by different regression line slopes in a scatterplot. Factorial validity refers to the extent a test or test item correlates with factors expected to be correlated with in a factor analysis. The extent a test does not correlate with measures of an unrelated construct is referred to as divergent validity. Convergent validity refers to the degree a test correlates with measures of the same or a similar construct.

A percentile rank is Select one: A.a norm-referenced score, but not a standard score. B.a standard score, but not a norm-referenced score. C.a standard score and a norm-referenced score. D.neither a standard score nor a norm-referenced score. Feedback

Correct Answer is: A To answer this question, you have to be able to define and understand three terms: norm-referenced, standard score, and percentile rank. A norm-referenced score is one that is interpreted in terms of a comparison to others who have taken the same test. A standard score is a type of norm-referenced score that is interpreted in terms of how many standard deviation units a score falls above or below the mean. Examples include z-scores and T-scores. A percentile rank indicates the percentage of scores that fall below a given score. For example, a person who achieves a percentile rank of 90 on the SAT scored better than 90% of others who took the test. Since interpretation of percentile ranks involves a comparison between scorers, a percentile rank is a norm-referenced score. However, since it is not interpreted in terms of standard deviation units, it is not a standard score.

All of the following statements regarding item response theory are true, except Select one: A.it cannot be applied in the attempt to develop culture-fair tests. B.it's a useful theory in the development of computer programs designed to create tests tailored to the individual's level of ability. C.one of its assumptions is that test items measure a "latent trait." D.it usually has little practical significance unless one is working with very large samples.

Correct Answer is: AItem response theory is a highly technical mathematical approach to item analysis. Use of item analysis is based on a number of complex mathematical assumptions. One of these assumptions, known as invariance of item parameters, holds that the characteristics of items should be the same for all theoretically equivalent groups of subjects chosen from the same population. Thus, any culture-free test should demonstrate such invariance; i.e., a set of items shouldn't have a different set of characteristics for minority and non-minority subgroups. it cannot be applied in the attempt to develop culture-fair tests. For this reason, item response theory has been applied to the development of culture-free tests, so this choice is not a true statement. The other choices are all true statements about item response theory, and therefore incorrect answers to this question. it's a useful theory in the development of computer programs designed to create tests tailored to the individual's level of ability. Consistent with this choice, item response theory is the theoretical basis of computer adaptive assessment, in which tests tailored to the examinee's ability level are computer generated. one of its assumptions is that test items measure a "latent trait." As stated by this choice, an assumption of item response theory is that items measure a latent trait, such as intelligence or general ability. it usually has little practical significance unless one is working with very large samples. And, finally, research supports the notion that the assumptions of item response theory only hold true for very large samples.

On the MMPI-2, what percentage of the general population the test is intended for can be expected to obtain a T-score between 40 and 60 on the depression scale? A.50 B.68 C.95 D.99

Correct Answer is: B A T-score is a standardized score. Standardization involves converting raw scores into scores that indicate how many standard deviations the values are above or below the mean. A T-score is a standard score with a mean of 50 and a standard deviation of 10. Results of personality inventories such as the MMPI-2 are commonly reported in terms of T-scores. Other standard scores include z-scores, with a mean of 0 and a standard deviation of 1, and IQ scores, with a mean of 100 and a standard deviation of 15. When values are normally distributed in a population, standardization facilitates interpretation of test scores by making it easier to see where a test-taker stands on the variable in relation to others in the population. This is because, due to the properties of a normal distribution, one always knows the percentage of cases that are within standard deviation ranges of the mean. For example, in a normal distribution, 68.26 % of scores will fall within one standard deviation of the mean, or in a T score distribution, between 40 and 60, so 68% is the best answer to this question. Another example: 95.44% of scores fall within two standard deviations of the mean; therefore, 4.56% will have scores 2 standard deviation units or more above or below the mean. By dividing 4.56 in half, we can see that 2.28% of test-takers will score 70 or above on any MMPI scale, and 2.28% will score 30 or below.

On the MMPI-2, what percentage of the general population the test is intended for can be expected to obtain a T-score between 40 and 60 on the depression scale? Select one: A.50 B.68 C.95 D.99

Raising the cutoff score on a predictor test would have the effect of Select one: A.increasing true positives B.decreasing false positives C.decreasing true negatives D.decreasing false negatives.

Correct Answer is: B A simple way to answer this question is with reference to a chart such as the one displayed under the topic "Criterion-Related Validity" in the Psychology-Test Construction section of your materials. If you look at this chart, you can see that increasing the predictor cutoff score (i.e., moving the vertical line to the right) decreases the number of false positives as well as true positives (you can also see that the number of both true and false negatives would be increased).You can also think about this question more abstractly by coming up with an example. Imagine, for instance, that a general knowledge test is used as a predictor of job success. If the cutoff score on this test is raised, fewer people will score above this cutoff and, therefore, fewer people will be predicted to be successful. Another way of saying this is that fewer people will come up "positive" on this predictor. This applies to both true positives and false positives.

The exam score and the ______________ are necessary to calculate the 68% confidence interval for an examinee's obtained test score. Select one: A.standard deviation B.standard error of measurement C.standard error of estimate D.test's mean

Correct Answer is: B Adding and subtracting one standard error of measurement to and from the examinee's obtained test score yields a 68% confidence interval. The standard error of measurement (calculated from the test's standard deviation and reliability coefficient) is needed to determine a confidence interval around an obtained test score. While the standard deviation is needed to calculate the standard error of measurement, it cannot be used to determine a confidence interval by itself and the standard error of estimate is used to construct a confidence interval around a predicted criterion score. standard error of measurement= A statistic which indicates how much an obtained test score can be expected to deviate from the "true" test score. Used to construct a confidence interval which indicates where an examinee's true test score is likely to fall, given the obtained score. If reliability is perfect (i.e., when the reliability coefficient = +1.0), the standard error of measurement is 0.

When looking at an item characteristic curve (ICC), which of the following provides information about how well the item discriminates between high and low achievers? Select one: A.the Y-intercept B.the slope of the curve C.the position of the curve (left versus right) D.the position of the curve (top versus bottom) Feedback

Correct Answer is: B An item response curve provides one to three pieces of information about a test item - its difficulty "the position of the curve (left versus right)"; its ability to discriminate between high and low scorers or in this case, achievers (correct answer); and the probability of answering the item correctly just by guessing "the Y-intercept".

Adding more easy to moderately easy items to a difficult test will: Select one: A.increase the test's floor. B.decrease the test's floor C.alter the test's floor only if there is an equal number of difficult to moderately difficult items. D.have no effect on the test's floor.

Correct Answer is: B As you may have guessed, "floor" refers to the lowest scores on a test (ceiling refers to the highest scores). Adding more easy to moderately easy items would lower or decrease the floor allowing for better discrimination of people at the low end.

An athlete is requested to take a drug screening test used to identify individuals with performance enhancing substances in their systems. Despite the player's actual usage of steroids, the test fails to identify the substances. In the context of decision-making theory, this individual is a: Select one: A.false positive B.false negative C.true positive D.true negative

Correct Answer is: B Based on performance on the predictor and the criterion, individuals may be classified as false positives, true positives, false negatives, or true negatives. False negatives, like the athlete, are not identified as having used substances when, in fact, they have. Conversely, false positives are identified by the drug screening test as having used or having substances present, when they have not. True positives are individuals identified by the screening test as having substances present and they do. True negatives are individuals not identified by the screening test to have substances present and do not.

An eigenvalue is the: Select one: A.proportion of variance attributable to two or more factors B.amount of variance in all the tests accounted for by a factor C.effect of one independent variable, without consideration of the effects of other independent variables. D.strength of the relationship between factors

Correct Answer is: B In a factor analysis or principal components analysis, the explained variance, or "eigenvalues" indicate the amount of variance in all the tests accounted for by a factor. proportion of variance attributable to two or more factors This choice describes "communality" which is another outcome of a factor analysis. effect of one independent variable, without consideration of the effects of other independent variables. This is the definition of a "main effect".

A test has a reliability coefficient of .90. What percentage of variability among examinees on this test is due to true score differences? Select one: A.1 B.0.9 C.0.81 D.0.5

Correct Answer is: B In most cases, you would square the correlation coefficient to obtain the answer to this question. However, the reliability coefficient is an exception to this rule: it is never squared. Instead, it is interpreted directly. This means that the value of the reliability coefficient itself indicates the proportion of variance in a test that reflects true variance.

A condition necessary for pooled variance is: Select one: A.unequal sample sizes B.equal sample sizes C.unequal covariances D.equal covariances

Correct Answer is: B Pooled variance is the weighted average variance for each group. They are "weighted" based on the number of subjects in each group. Use of a pooled variance assumes that the population variances are approximately the same, even though the sample variances differ. When the population variances were known to be equal or could be assumed to be equal, they might be labeled equal variances assumed, common variance or pooled variance. Equal variances not assumed or separate variances is appropriate for normally distributed individual values when the population variances are known to be unequal or cannot be assumed to be equal.

Which of the following illustrates the concept of shrinkage? Select one: A.extremely depressed individuals obtain a high score on a depression inventory the first time they take it, but obtain a slightly lower score the second time they take it B.items that have collectively been shown to be a valid way to diagnose a sample of individuals as depressed prove to be less valid when used for a different sample C.the self-esteem of depressed individuals shrinks when they are faced with very difficult tasks D.abilities such as short-term memory and response speed diminish as we get older

Correct Answer is: B Shrinkage can be an issue when a predictor test is developed by testing out a pool of items on a validation ("try-out") sample and then choosing the items that have the highest correlation with the criterion. When the chosen items are administered to a second sample, they usually don't work quite as well -- in other words, the validity coefficient shrinks. This occurs because of chance factors operating in the original validation sample that are not present in the second sample.

The purpose of rotation in factor analysis is to facilitate interpretation of the factors. Rotation: Select one: A.alters the factor loadings for each variable but not the eigenvalue for each factor B.alters the eigenvalue for each factor but not the factor loadings for the variables C.alters the factor loadings for each variable and the eigenvalue for each factor D.does not alter the eigenvalue for each factor nor the factor loadings for the variables

Correct Answer is: C In factor analysis, rotating the factors changes the factor loadings for the variables and eigenvalue for each factor although the total of the eigenvalues remains the same.

In computing test reliability, to control for practice effects one would use a(n):I. split-half reliability coefficient.II. alternative forms reliability coefficient.III. test-retest reliability coefficient. Select one: A.I and III only B.I and II only C.II and III only D.II only

Correct Answer is: B The clue here is the practice effect. That means that if you give a test, just taking it will give the person practice so that next time, he or she is not a naive person. To control for that, we want to eliminate the situation where the person is administered the same test again. So we do not use test-retest. We can use the two other methods listed. We can use split-half since, here, only one administration is used (the two parts are thought of as two different tests). And, in the alternative forms method, a different test is given the second time, controlling for the effects of taking the same test twice.

What value is preferred for the average item difficulty level in order to maximize the size of a test's reliability coefficient? Select one: A.10 B.0.5 C.1 D.0

Correct Answer is: B The item difficulty index ranges from 0 to 1, and it indicates the number of examinees who answered the item correctly. Items with a moderate difficulty level, typically 0.5, are preferred because it helps to maximize the test's reliability.

A kappa coefficient of .93 would indicate that the two tests Select one: A.measure what they are supposed to. B.have a high degree of agreement between their raters. C.aren't especially reliable. D.present test items with a high level of difficulty.

Correct Answer is: B The kappa coefficient is used to evaluate inter-rater reliability. A coefficient in the lower .90s indicates high reliability. This option ("measure what they are supposed to") is a layman's definition of the general concept of validity.

The appropriate kind of validity for a test depends on the test's purpose. For example, for the psychology licensing exam: Select one: A.construct validity is most important because it measures the hypothetical trait of "competence." B.content validity is most important because it measures knowledge of various content domains in the field of psychology. C.criterion-related validity is most important because it predicts which psychologists will and will not do well as professionals. D.no evidence of validity is required.

Correct Answer is: B The psychology licensing exam is considered a measure of knowledge of various areas in the field of psychology and, therefore, is essentially an achievement-type test. Measures of content knowledge should have adequate content validity.

Which of the following procedures involves identifying the underlying structure in a set of variables? Select one: A.multiple regression B.factor analysis C.canonical correlation D.discriminant analysis

Correct Answer is: B The purpose of factor analysis is to determine the degree to which many tests or variables are measuring fewer, underlying constructs. For example, factor analyses of the WAIS-III have suggested that four factors -- verbal comprehension, perceptual organization, processing speed, and working memory -- explain, to a large degree, scores on the fourteen subtests. Another way of saying this is that a factor analysis helps to identify the underlying structure in a set of variables.

A test developer creates a new test of anxiety sensitivity and correlates it with an existing measure of anxiety sensitivity. The test developer is operating under the assumption that Select one: A.the new test is valid. B.the existing test is valid. C.the new test is reliable. D.the existing test is reliable.

Correct Answer is: B The question is describing an example of obtaining evidence for a test's construct validity. Construct validity refers to the degree to which a test measures a theoretical construct that it purports to measure; anxiety sensitivity is an example of a theoretical construct measured in psychological tests. A high correlation between a new test and an existing test that measures the same construct offers evidence of convergent validity, which is a type of construct validity. Another type is divergent validity, which is the degree to which a test has a low correlation with another test that measures a different construct. Correlating scores on a new test with an existing test to assess the new test's convergent validity requires an assumption that the existing test is valid; i.e., that it actually does measure the construct.

f, in a normally-shaped distribution, the mean is 100 and the standard error of measurement is 5, what would the 95% confidence interval be for an examinee who receives a score of 90? Select one: A.75-105 B.80-100 C.90-100 D.95-105

Correct Answer is: B The standard error of measurement indicates how much error an individual test score can be expected to have. A confidence interval indicates the range within which an examinee's true score is likely to fall, given his or her obtained score. To calculate the 95% confidence interval we simply add and subtract two standard errors of measurement to the obtained score. Two standard errors of measurement in this case equal 10. We're told that the examinee's obtained score is 90. 90 +/- 10 results in a confidence interval of 80 to 100. In other words, we can be 95% confident that the examinee's true score falls within 80 and 100.

Which of the following would be used to determine how well an examinee did on a test in terms of a specific standard of performance? Select one: A.norm-referenced interpretation B.criterion-referenced interpretation C.domain-referenced interpretation D.objectives-referenced interpretation

Correct Answer is: B There are several ways an examinee's test score can be interpreted. In this question, a criterion-referenced interpretation, an examinee's test performance is interpreted in terms of an external criterion, or standard of performance. norm-referenced interpretation In a norm-referenced interpretation, an examinee's test performance is compared to the performance of members of the norm group (other people who have taken the test). domain-referenced interpretation Domain-referenced interpretation is used to determine how much of a specific knowledge domain the examinee has mastered. objectives-referenced interpretation Objectives-referenced interpretation involves interpreting an examinee's performance in terms of achievement of instructional objectives.

A test with limited ceiling would have a ____________ distribution shape. Select one: A.normal B.flat Incorrect C.positively skewed D.negatively skewed

Correct Answer is: D A test with limited ceiling has an inadequate number of difficult items resulting in few low scores. Therefore the distribution would be negatively skewed.

Which of the following statements is not true regarding concurrent validity? Select one: A.It is used to establish criterion-related validity. B.It is appropriate for tests designed to assess a person's future status on a criterion. C.It is obtained by collecting predictor and criterion scores at about the same time. D.It indicates the extent to which a test yields the same results as other measures of the same phenomenon.

Correct Answer is: B There are two ways to establish the criterion-related validity of a test: concurrent validation and predictive validation. In concurrent validation, predictor and criterion scores are collected at about the same time; by contrast, in predictive validation, predictor scores are collected first and criterion data are collected at some future point. Concurrent validity indicates the extent to which a test yields the same results as other measures of the same phenomenon. For example, if you developed a new test for depression, you might administer it along with the BDI and measure the concurrent validity of the two tests.

Which statement is most correct? Select one: A.High reliability assumes high validity. B.High validity assumes high reliability. C.Low validity assumes low reliability. D.Low reliability assumes low validity.

Correct Answer is: B This question is difficult because the language of the response choices is convoluted and imprecise. We don't write questions like this because we're sadistic; it's just that you'll sometimes see this type of language on the exam as well, and we want to prepare you. What you need to do on questions like this is bring to mind what you know about the issue being asked about, and to choose the answer that best applies. Here, you should bring to mind what you know about the relationship between reliability and validity: For a test to have high validity, it must be reliable; however, for a test to have high reliability, it does not necessarily have to be valid. With this in mind, you should see that "high validity assumes high reliability" is the best answer. This means that a precondition of high validity is high reliability. The second best choice states that low reliability assumes low validity. This is a true statement if you interpret the word "assume" to mean "implies" or "predicts." But if you interpret the word "assume" to mean "depends on" or "is preconditioned by," the statement is not correct.

Justification for the use of a selection procedure or battery in a new setting without conducting a local validation research study is referred to as: Select one: A.synthetic validity. B.validity generalization. C.transportability. D.meta-analysis

Correct Answer is: B Validity generalization, or generalized evidence of validity, is evidence of validity that generalizes to setting(s) other than the setting(s) in which the original validation evidence was documented and is accumulated through strategies such as synthetic validity/job component validity*, transportability* and meta-analysis* (* incorrect choices). Synthetic validity/job component validity is based on previous demonstration of the validity of inferences from scores on the selection procedure or battery with respect to one or more domains of work (job components). Transportability refers to a strategy for generalizing evidence of validity in which demonstration of important similarities between different work settings is used to infer that validation evidence for a selection procedure accumulated in one work setting generalizes to another work setting.

A predictor that is highly sensitive for identifying the presence of a disorder would most likely result in: Select one: A.measurement error B.type II error C.a high number of false positives D.a high number of false negatives

Correct Answer is: C A predictor that is highly sensitive will more likely identify the presence of a characteristic; that is, it will result in more positives (true and false). This may be desirable when the risk of not detecting a problem is high. For example, in the detection of cancer, a blood test that results in a high number of false positives is preferable to one that has many false negatives. A positive test result can then be verified by another method, for example, a biopsy. Measurement error is the part of test scores which is due to random factors. Type II error is an error made when an experimenter erroneously accepts the null hypothesis.

Which statement is most true about validity? Select one: A.Validity is never higher than the reliability coefficient. B.Validity is never higher than the square of the reliability coefficient. C.Validity is never higher than the square root of the reliability coefficient. D.Validity is never higher than 1 minus the reliability coefficient.

Correct Answer is: C A test's reliability sets an upper limit on its criterion-related validity. Specifically, a test's validity coefficient can never be higher than the square root of its reliability coefficient. In practice, a validity coefficient will never be that high, but, theoretically, that's the upper limit.

If a predictor test has a validity of 1.0, the standard error of estimate would be Select one: A.equal to the standard deviation of the criterion measure. B.1 C.0 D.unknown; there is not enough information to answer the question.

Correct Answer is: C A validity coefficient and the standard error of estimate are both measures of the accuracy of a predictor test. A validity coefficient is the correlation between scores on a predictor and a criterion (outcome) measure. A coefficient of 1.0 reflects a perfect correlation; it would mean that one would always be able to perfectly predict, without error, the scores on the outcome measure. The standard error of estimate indicates how much error one can expect in the prediction or estimation process. If a predictor test has perfect validity, there would be no error of estimate; you would always know the exact score on the outcome measure just from the score on the predictor. Therefore, the closer the validity coefficient is to 1.0, the smaller the value of the standard error of estimate, and if the coefficient were 1.0, the standard error of estimate would be 0.

Likert scales are most useful for: A.dichotomizing quantitative data B.quantifying objective data C.quantifying subjective data D.ordering categorical data

Correct Answer is: C Attitudes are subject phenomena. Likert scales indicate the degree to which a person agrees or disagrees with an attitudinal statement. Using a Likert scale, attitudes are quantified - or represented in terms of ordinal scores.

Which of the following item difficulty levels maximizes discrimination among test-takers? Select one: A.0.1 B.0.25 C.0.5 D.0.9

Correct Answer is: C If a test item has an item difficulty level of .50, this means that 50% of examinees answered the item correctly. Therefore, items with this difficulty level are most useful for discriminating between "high scoring" and "low scoring" groups.

Which of the following would be used to measure the internal consistency of a test? Select one: A.kappa coefficient B.test-retest reliability C.split-half reliability D.alternate forms reliability

Correct Answer is: C Internal consistency is one of several types of reliability. As its name implies, it is concerned with the consistency within a test, that is, the correlations among the different test items. Split-half reliability is one of the measures of internal consistency and involves splitting a test in two and correlating the two halves with each other. Other measures of internal (inter-item) consistency are the Kuder-Richardson Formula 20 (for dichotomously scored items) and Cronbach's coefficient alpha (for multiple-scored items). Test-retest reliability is not concerned with internal consistency, but rather, the stability of a test over time, and uses the correlations of scores between different administrations of the same test. Alternative forms reliability is concerned with the equivalence of different versions of a test. And the kappa coefficient is used as a measure of inter-rater reliability, that is, the amount of agreement between two raters.

Which of the following techniques would be most useful for combining test scores when poor performance on one test can be offset by excellent performance on another: Select one: A.multiple baseline B.multiple hurdle C.multiple regression D.multiple cutoff

Correct Answer is: C Multiple regression is the preferred technique for combining test scores in this situation as it is a compensatory technique since a low score on one test can be offset (compensated for) by high scores on other tests. Multiple baseline is not a method for combining test scores. Multiple hurdle and multiple cutoff are noncompensatory techniques.

If an examinee correctly guesses the answers to a test, the reliability coefficient: Select one: A.is not affected B.stays the same C.decreases D.increases

Correct Answer is: C One of the factors that affect the reliability coefficient is guessing. Guessing correctly decreases the reliability coefficient. The incorrect options ("is not affected," "stays the same," and "increases") are not true in regards to the reliability coefficient.

Form A is administered to a group of employees in Spring and then again in Fall. Using this method, what type of reliability is measured? A.split-half B.equivalence C.stability D.internal consistency

Correct Answer is: C Test-retest reliability, or the coefficient of stability, involves administering the same test to the same group on two occasions and then correlating the scores. split-half Split-half reliability is a method of determining internal consistency reliability. equivalence Alternative forms reliability, or coefficient of equivalence, consists of administering two alternate forms of a test to the same group and then correlating the scores. internal consistency Internal consistency reliability utilizes a single test administration and involves obtaining correlations among individual test items.

Form A is administered to a group of employees in Spring and then again in Fall. Using this method, what type of reliability is measured? Select one: A.split-half B.equivalence C.stability D.internal consistency

A way to define criterion in regard to determining criterion related validity is that the criterion is: Select one: A.The predictor test B.The validity measure C.The predictee D.The content.

Correct Answer is: C To determine criterion-related validity, scores on a predictor test are correlated with an outside criteria. The criteria is that which is being predicted, or the "predictee."

When establishing a test's reliability, all other things being equal, which is likely to be the lowest magnitude? Select one: A.split-half B.Cronbach's alpha C.alternate forms D.test-retest

Correct Answer is: C You probably remember that the alternate forms coefficient is considered by many to be the best reliability coefficient to use when practical (if you don't, commit this factoid to memory now). Everything else being equal, it is also likely to have a lower magnitude than the other types of reliability coefficients. The reason for this is similar to the reason why it is considered the best one to use. To obtain an alternate forms coefficient, one must administer two forms of the same test to a group of examinees, and correlate scores on the two forms. The two forms of the test are administered at different times and (because they are different forms) contain different items or content. In other words, there are two sources of error (or factors that could lower the coefficient) for the alternate forms coefficient: the time interval and different content (in technical terms, these sources of error are referred to respectively as "time sampling" and "content sampling"). The alternate forms coefficient is considered the best reliability coefficient by many because, for it to be high, the test must demonstrate consistency across both a time interval and different content.

If a job selection test has lower validity for Hispanics as compared to White or African-Americans, you could say that ethnicity is acting as a: Select one: A.confounding variable B.criterion contaminator C.discriminant variable D.moderator variable

Correct Answer is: D A moderator variable is any variable which moderates, or influences, the relationship between two other variables. If the validity of a job selection test is different for different ethnic groups (i.e. there is differential validity), then ethnicity would be considered a moderator variable since it is influencing the relationship between the test (predictor) and actual job performance (the criterion). A confounding variable is a variable in a research study which is not of interest to the researcher, but which exerts a systematic effect on the DV. Criterion contamination is the artificial inflation of validity which can occur when raters subjectively score ratees on a criterion measure after they have been informed how the ratees scored on the predictor.

A percentage score, as opposed to a percentile rank, is based on: Select one: A.Total number of items B.An examinee's score in comparison to other examinee's scores C.That there are one hundred test items D.The number of items answered correctly

Correct Answer is: D A percentage score indicates the number of items answered correctly. A percentile rank compares one examinee's score with all other examinee's scores.

The sensitivity of a screening for a psychological disorder refers to Select one: A.the ratio of correct to incorrect diagnostic decisions its use results in. B.the proportion of correct diagnostic decisions its use results in. C.the proportion of individuals without the disorder it identifies. D.the proportion of individuals with the disorder it identifies.

Correct Answer is: D In any test used to make a "yes/no" decision (e.g., screening tests, medical tests such as pregnancy tests, and job selection tests in some cases), the term "sensitivity" refers to the proportion of correctly identified cases--i.e., the ratio of examinees whom the test correctly identifies as having the characteristic to the total number of examinees who actually possess the characteristic. You can also conceptualize sensitivity in terms of true positives and false negatives. A "positive" on a screening test means that the test identified the person as having the condition, while a "negative" is someone classified by the test as not having the condition. The term true and false in this context refer to the accuracy or correctness of test results. Therefore, sensitivity can be defined as the ratio of true positives (people with the condition whom the test correctly detects) to the sum of true positives and false negatives (all the examinees who have the condition).

Researchers are interested in detecting differential item functioning (DIF). Which method would not be used? A.SIBTEST B.Mantel-Haenszel C.Lord's chi-square D.cluster analysis Feedback

Correct Answer is: D In the context of item response theory, differential item functioning (DIF), or item bias analysis, refers to a difference in the probability of individuals from different subpopulations making a correct or positive response to an item, who are equal on the latent or underlying attribute measured by the test. The SIBTEST or simultaneous item bias test, Mantel-Haenszel, and Lord's chi-square are statistical techniques used to identify DIF. Cluster analysis is a statistical technique used to develop a classification system or taxonomy. This method wouldn't detect item bias or differences.

Which of the following would be used to determine the probability that examinees of different ability levels are able to answer a particular test item correctly? Select one: A.criterion-related validity coefficient B.item discrimination index C.item difficulty index D.item characteristic curve

Correct Answer is: D Item characteristic curves (ICCs), which are associated with item response theory, are graphs that depict individual test items in terms of the percentage of individuals in different ability groups who answered the item correctly. For example, an ICC for an individual test item might show that 80% of people in the highest ability group, 40% of people in the middle ability group, and 5% of people in the lowest ability group answered the item correctly. Although costly to derive, ICCs provide much information about individual test items, including their difficulty, discriminability, and probability that the item will be guessed correctly.

Of the following, the highest rate of suicide occurs among Select one: A.married persons. B.never-married persons. C.widowed persons. D.divorced persons.

Correct Answer is: D Marriage, especially when reinforced with children, appears to lessen the risk of suicide. Among married people, the rate of suicide is about 11 per 100,000. This rate is higher for single, never-married persons (about 22 per 100,000), even higher for widows (24 per 100,000), and higher still for divorced individuals (40 per 100,000).

As a measure of test reliability, an internal consistency coefficient would be least appropriate for a Select one: A.vocational aptitude test. B.intelligence test. C.power test. D.speed test.

Correct Answer is: D Tests can be compared to each other in terms of whether they emphasize power or speed. A pure speed test contains relatively easy items and has a strict time limit; it is designed to measure examinees' speed of response. A pure power test supplies enough time for most examinees to finish and contains items of varying difficulty. Power tests are designed to assess examinees' knowledge or ability in whatever content domain is being measured. Many tests measure both power and speed. An internal consistency reliability coefficient measures the correlation of responses to different items within the same test. On a pure speed test, all items answered are likely to be correct. As a result, the correlation between responses is artificially inflated; therefore, for speed tests, other measures of reliability, such as test-retest or alternate forms, are more appropriate.

Kuder-Richardson reliability applies to Select one: A.split-half reliability. B.test-retest stability. C.Likert scales. D.tests with dichotomously scored questions.

Correct Answer is: D The Kuder-Richardson formula is one of several statistical indices of a test's internal consistency reliability. It is used to assess the inter-item consistency of tests that are dichotomously scored (e.g., scored as right or wrong).

The slope of the item response curve, with respect to item response theory, indicates an item's: Select one: A.reliability B.validity C.difficulty D.discriminability

Correct Answer is: D The item response curve provides information about an item's difficulty; ability to discriminate between those who are high and low on the characteristic being measured; and the probability of correctly answering the item by guessing. The position of the curve indicates its difficulty* and the steeper the slope of the item response curve, the better its ability to discriminate (correct response) between examinees who are high and low on the characteristic being measured. The item response curve does not indicate reliability* or validity* (* incorrect options).

In a clinical trial of a new drug, the null hypothesis is the new drug is, on average, no better than the current drug. It is concluded that the two drugs produce the same effect when in fact the new drug is superior. This is: Select one: A.corrected by reducing the power of the test B.corrected by reducing the sample size C.a Type I error D.a Type II error

Correct Answer is: D Type II errors occur when the null hypothesis is not rejected when it is in fact false; Type I errors are often considered more serious as the null hypothesis is wrongly rejected. For example, in the clinical trial of a new drug, this would be concluding that the new drug was better when in fact it was not. Type I and II errors are inversely related: as the probability of a Type I error increases, the probability of a Type II error decreases, and vice versa.

In the multitrait-multimethod matrix, a large heterotrait-monomethod coefficient would indicate: Select one: A.low convergent validity. B.high convergent validity. C.high divergent validity. D.low divergent validity.

Correct Answer is: D Use of a multitrait-multimethod matrix is one method of assessing a test's construct validity. The matrix contains correlations among different tests that measure both the same and different traits using similar and different methodologies. The heterotrait-monomethod coefficient, one of the correlation coefficients that would appear on this matrix, reflects the correlation between two tests that measure different traits using similar methods. An example might be the correlation between a test of depression based on self-report data and a test of anxiety also based on self-report data. If a test has good divergent validity, this correlation would be low. Divergent validity is the degree to which a test has a low correlation with other tests that do not measure the same construct. Using the above example, a test of depression would have poor divergent validity if it had a high correlation with other tests that purportedly measure different traits, such as anxiety. This would be evidence that the depression test is measuring traits that are unrelated to depression.

Test Construction- Exam Qs

Conjuntos de estudio relacionados

HTML & CSS by Jon Duckett

Ch 10 Head & Neck (Seidel)

Lecture 13 - Speech Perception

Compensation and Benefits Chapter 4- Incentive and Variable Pay

History 1221 Exam

Chapter 18: Apoptosis

Muscles of the Shoulder Girdle

apes unit 1 group

Environmental unit 4&5

Presynaptic Neuron & Post Synaptic Neuron

MASTERING BIOLOGY CHP 17 Transcription and Translation

Balance sheet

Chapter 11

Chapter 10: Electronic Commerce Security

Expected findings

Unit 3

Peds Test 3 Practice Questions

Chapter 11: Traditional Leadership Approaches

Georgia law, rules and regulations

Java Applications Final(100)