Modules

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Identify three uses of a reliability coefficient.

According to Revelle and Condon (2019), reliability coefficients can be used to estimate expected scores, provide confidence intervals around expected scores, and correct for attenuation.

Why is reliability important in psychology and educational testing?

Consistency in measurement is a prerequisite for meaningful scientific inquiry. Without reliable measurement, conclusions drawn from research would be meaningless.

Messick (1995b) identified six aspects of construct validation. Choose any three of these aspects to discuss how Messick's conceptualization has extended your awareness of the meaning of construct validation.

A number of correct responses are possible for this item. Responses should correctly explain the three aspects of validation selected.

Identify two established measures that could be used (other than those discussed previously) to examine the convergent validity of the affective empathy scale discussed above.

A number of correct responses are possible. The response should provide measures that assess constructs that are theoretically related to the construct of affective empathy.

Identify two established measures that could be used to examine the discriminant validity of the affective empathy measure.

A number of correct responses are possible. The response should provide measures that assess constructs that are theoretically unrelated to the construct of affective empathy.

Consider a test or inventory of your choosing. If you wanted to examine the content validity of this measure, how would you go about choosing experts to provide judgments?

A number of possible acceptable responses may be provided. The choice of experts should be either logical for the given test or inventory, or should be fully explained.

Explain how a researcher could conduct a "study of process" to provide evidence of the construct validity of test scores.

A study of process involves an examination of the thought process by which a respondent derives an answer. One way a researcher could conduct a study of process is to conduct a pilot study in which each participant is asked to verbally report his or her entire thought process in the development of a response to each test item. Using such a methodology, a researcher may detect issues that would reduce the construct validity of the test.

Is correcting for guessing appropriate in college-level courses where most individuals will not be guessing randomly, but rather will almost always be able to eliminate one or more distracter options?

A true random guessing formula will most likely under correct for guessing in this situation. The key to correcting for guessing in classroom situations is that you have to make it crystal clear to the test takers that there will be a correction for guessing formula invoked. However, the smart students will realize that the correction for guessing formula assumes random guessing, so that if they can confidently eliminate a least one of the options it is to their advantage to guess. Since the instructor can't control for this, it is rarely employed in classroom situations.

In your own words, explain the concept of a true score.

A true score is a hypothetical value of a trait for an individual on a particular test or measure. A true score is the average score of the theoretical distribution of raw scores that would be obtained from an infinite number of independent test administrations for the individual.

It was noted that if a test taker can eliminate at least one of the distracters then corrections for guessing underestimate the extent of guessing. Is it possible to overestimate the extent of guessing with corrections formulas? If so, how?

About the only way that this can happen is when either there is more than one correct answer or there is a distracter option that is particularly attractive to test takers, thus more test takers select that particular distracter than any other option, including the keyed answer.

How do you determine the difficulty, discrimination, and pseudo-guessing parameters in IRT? How are they different from CTT-IA?

All three parameters are statistically estimated by conducting complex nonlinear computations. In IRT, models are developed and the fit of those models are assessed. Typically the number of parameters estimated (1, 2, or 3) defines the model. Thus, one may start with a three-parameter model. If that model does not fit the data very well, a two-parameter model may be estimated. If that does not work well, then a one-parameter model may be estimated. Thus, the more parameters that are estimated, the stronger the rationale required for the need to estimate the additional parameters. If you have polytomous items, the models are more complicated.

What are some of the major assumptions of CTT?

Among the most important assumptions of CTT: (a) Error is random. Thus, the average of error scores is zero. (b) True scores are uncorrelated with error scores.

How does use of an MTMM matrix provide evidence of the construct validity of test scores?

An MTMM allows visual inspection of the patterns of relationships between constructs that are both theoretically similar and theoretically dissimilar. Further, the constructs that are included in an MTMM are measured in a number of ways. Inspection of the correlations between theoretically similar constructs assessed by different methods (the validity diagonals) should exceed the relationships between different traits, whether those traits are measured using different methods or the same method of data collection. High correlations on the validity diagonals provide evidence of convergent validity. If the magnitudes of these correlations exceed the correlations in the heterotrait-heteromethod and heterotrait-monomethod triangles, we also have evidence of discriminant validity.

What factors would you consider to ensure that you have an appropriate criterion?

An appropriate criterion will be reliable, practical, logically related to the test under consideration, and allow for discrimination (i.e., not everyone receives the same score).

What is the difference between an item difficulty index and an item discrimination index?

An item difficulty index (e.g., p-value) gives an indication of what percentage of test takers answered an item correctly. The higher the percentage, the easier the item was for the test takers. The item discrimination index (e.g., the point-biserial correlation between whether test takers answered a particular item correctly and their total test score) provides an index of the degree to which an item distinguishes high and low test takers. A positive item discrimination index would indicate that those test takers answering the item correctly were more likely to do well on the test overall; this is a good thing! The higher the value, the better the item discriminates test takers. A negative item discrimination index would be undesirable, as it would indicate that those test takers who answered an item correctly actually did worse on the test over all.

What are the best ways to reduce response biases? Response styles?

As noted above, it is much more difficult to deal with response styles than it is response biases. However, in either case, it is much better to try to prevent it in the first place than it is to try to "correct" for the response artifact after the fact. Thus, using strategies such as reducing social desirability, having properly worded questions, and using forced choice items, for example, can help to dramatically reduce response biases.

What is the difference between response biases and response styles?

As noted in the module overview, response biases are measurement artifacts that emerge from the testing context. Thus, rater training and proper instructions can often prevent such biases. However, response styles are not context specific as they represent general ways of responding (e.g., nay-saying) regardless of the context. Thus, it would be more difficult to prevent or reduce response styles.

Based on the data in Figure 11.1, what would have happened if we had used a common regression line to predict suicide risk in all three age groups?

Assuming that the groups are approximately equal in number, it appears that we would under predict the risk of suicide for the young and old, while over predicting the risk of suicide for those individuals in middle age. However, if for example there were a lot more middle-aged individuals than those young or old, the common regression line would be "pulled" toward the current middle-aged line displayed in Figure 11.1. As a result, while the same pattern of over and under prediction would occur, it would be more so for the young and old, while less so for the middle aged respondents.

What would you do if the expected dimensionality of your scale were very different from the results suggested by your factor analysis?

Assuming the sample size was adequate relative to the number of items factor analyzed, it would be appropriate to conclude that the respondents viewed the scale dimensionality differently than the test developer expected. Close inspection of the items composing the factors may reveal a fresh understanding of the construct. Of course, repetition of the factors in a second sample would contribute confidence to the new found dimensionality.

What other personality characteristics, besides risk taking, do you think would be associated with guessing on multiple-choice tests?

Back in Module 12 we discussed the issue of "test-wiseness." Students who are test-wise will be able to guess correctly more often than students who do not possess these skills. They will tend to be more confident in their guessing as well. Personality characteristics such as impulsiveness, openness to experience, or intuitiveness (individuals who like to go with their gut feelings) may also be associated with guessing. Your students can probably come up with many more potential personal characteristics related to the tendency to guess on multiple-choice exams.

Describe the process of back translation. Why is back translation insufficient to guarantee equivalence?

Back translation refers to tests that have been translated by bilingual individuals from English to the target language, and then re-translated back into English by other bilingual individuals. If the re-translated English version of the test is highly similar to the original English version, then the target language test version is considered to be acceptable. Back translation fails to guarantee test equivalence because equivalence refers to more than mere language equivalency. Idiomatic expressions are very difficult to translate into another language. Concepts that are important in one culture may be meaningless in other cultures. Further exploration of equivalence is necessary.

What does it mean to say that item and person parameters are invariant (i.e., locally independent) in IRT models, but not in CTT-IA?

Basically it means that item and person parameters generalize from one population to another. That is, the parameters are not dependent on the sample used to compute the item statistics. Thus, with CTT-IA an item may be identified as difficult or easy depending on the sample used to compute the CTT-IA statistics. However, with IRT procedures, the nature of the sample used does not determine the difficulty. Regardless if low or high level ability test takers are used, the same item parameters will be established. There are some boundary conditions on invariance in IRT, though the parameter estimates are more invariant than in CTT.

Why are we more likely to use estimation rather than statistical significance testing in applied psychological measurement?

Because we are more interested in estimating individuals' true scores on a given construct, we would be most interested in creating confidence intervals (a form of estimation). We do occasionally compare groups (control versus experimental or men versus women) and thus use statistical significance testing, but even in that case, we typically follow up the statistical significance tests with a point estimate of effect size.

Why is common method variance (CMV) a concern in construct validation studies that involve correlation matrices?

CMV is a problem that results in inflated correlations between measures due to use of the same method to collect data for each variable. If the same method is used to collect data on each variable, the resulting correlations between variables may be due to response bias rather than the actual relationships between constructs of interest.

Correlations between what elements of an MTMM matrix would provide the best assessment of CMV?

CMV would be detected by high correlations between theoretically dissimilar traits measured by the same method of data collection. These are referred to as the heterotrait-monomethod triangles in an MTMM.

How is Cohen's kappa different from the other forms of reliability?

Cohen's kappa is different in that, first it focuses on raters and not the test takers or the test itself. Second, it requires that the data be dichotomous in nature. That is, the test takers are rated as being in either one category or another. If that is how the data was collected originally, then you are left with little choice but to compute Cohen's kappa to assess the inter-rater reliability. The problem comes in when data analysts dichotomize continuous rating so that it can be assessed using Cohen's kappa. In this latter instance, one is better served to run an intra-class or inter-class correlation to assess inter-rater reliability, as it will have more power and precision.

What concerns might you have in using a concurrent or postdictive criterion-related validation design?

Concurrent and postdictive criterion-related validity studies are often based on samples with restricted variance on the variables of interest. This restriction in range can reduce the magnitude of the resulting correlation. Further, the sample used to validate the test may differ substantially from the population to whom the researcher wishes to generalize the results of the study.

Would content validity alone provide sufficient evidence for validity for a(n) (a) employment exam? (b) extraversion inventory? (c) test to determine the need for major surgery? In each case, provide an argument for your reasoning.

Content validation alone would not provide convincing evidence of the accuracy of any of the tests presented in a, b, or c. However, practical limitations and the importance of the testing should be considered in each case. Given monetary and time restrictions, organizations sometimes rely solely on content validation as a means for examining the validity of a selection test. A measure of extraversion intended primarily for research may initially be subjected to content validation, but additional evidence of test validity would eventually be needed. Any test used for making decisions as important as determining whether surgery is the preferred treatment should undergo rigorous validation beyond mere content validation alone.

Validity is a unified construct. In what ways does content validity provide validity evidence?

Content validation provides evidence that knowledgeable experts agree that the items on the test are reasonable for the intended purpose of the test. These subject matter experts (SMEs) can make judgments regarding the degree to which the items that compose the test are representative of the content domain. This is only one aspect of test validation, but certainly it is an important aspect.

Would it be more appropriate to adopt a content validation approach to examine a final exam in a personality psychology course or to examine a measure of conscientiousness? Explain.

Content validation relies heavily on the judgment of experts. Further, the content domain must be clearly defined. While subject matter experts can be found for both an academic exam and a measure of personality, the content domain is more clearly and concretely defined for an academic exam. Therefore, the content validity of an academic exam is highly appropriate.

What did Cronbach and Meehl (1955) mean by the term "nomological network"?

Cronbach and Meehl (1955) recognized that tests are intended to measure abstract constructs. They posited that we must develop a theory to define our construct of interest. This theory then defines the pattern of expected relationships between our construct and other constructs, our construct and other measures, and between our measure of the construct and other measures. Together, this expected pattern of relationships is referred to as a nomological network.

Which sources of error tend to decrease the reliability of a measure? Which source of error tends to lead to an overestimate of the reliability of a measure?

Most sources of error attenuate the observed reliability of the measure. However, memory effects can inflate an observed reliability coefficient.

How would we scale multiple dimensions at one time?

Multidimensional scaling procedures allow us to scale multiple factors at once. They do not necessarily allow us to scale different dimensions (as discussed in question 3) at one time; instead they allow us to scale individuals (a single dimension) on multiple factors or dimensions.

How do descriptive statistics and standardized scores allow us to interpret a set of test scores? Why?

Descriptive statistics allow us to summarize an entire set of (test) scores. By creating frequency distributions, histograms, and polygons we can display the test data. Point estimates such as means and standard deviations also help us summarize a group of test data. However, standardizing the test scores allows us to compare the test scores from different distributions directly. For example, a T-score of 60 (one standard deviation above the mean) is instantly recognized as being a higher score relative to a z-score of .75 (three quarters of one standard deviation above the mean).

Under what conditions might you choose to use PCA? EFA?

Either principal components analysis (PCA) or exploratory factor analysis (EFA) are used to determine the dimensionality of a data set when there are no a priori expectations regarding the measure's dimensionality. PCA is used to extract total variance from a set of items, while EFA is more appropriate for identification of common variance present in latent constructs.

The various criterion-related validity research designs might not be equally appropriate for a given situation. For each of the following criterion-related validity designs, provide an example situation in which that design might be preferred: a) Predictive b) Concurrent c) Postdictive

Examples: Predictive - Examining whether GRE verbal scores are predictive of performance in a psychology graduate program. Use of the predictive design would require prospective graduate students to take the GRE verbal. However, students would be selected on the basis of some other factors, such as undergraduate GPA. Once the new graduate students had established a record of performance, GRE verbal scores could be correlated with the measure of graduate student performance. If a good correlation existed between the test and criterion, the GRE verbal score could be used to predict the graduate performance of future applicants. This procedure would reduce concerns with restriction of range on test scores in comparison to use of a concurrent design. Concurrent - An organization needs to fill 20 job openings in a limited amount of time. Since previous selection tests used by the organization have been deemed to be of questionable legality, the company creates six new selection tests. By administering the new tests to current employees in the job, and then collecting the job performance criterion information on these employees, the organization can quickly tell which of the six tests are most highly related to job performance. While restriction in range on test scores is likely a concern, adoption of the concurrent design will allow the company to quickly implement a new selection system. Postdictive - An instructor may wonder whether a student's intelligence influences the overall letter grade they receive in fourth grade. Since all students at her school take a standardized intelligence test in the fifth grade, the teacher obtains scores on this test and correlates them with the grades assigned to these same students while they were in fourth grade to determine whether intelligence is a major factor in grade attainment.

Does face validity establish content validity? Explain your answer.

Face validity has generally been regarded as inferior to the general notion of content validity, since test takers often do not have the adequate expertise to judge the relevance of a test. However, research has provided evidence of the influence of test takers' perceptions of a test on subsequent test performance. Therefore, while face validity may not provide expert judgment of the appropriateness of items, it certainly is an important element in the validation process.

What other factors, besides guessing, might contribute to extremely low or high levels of variability in knowledge test scores?

Factors such as the average level of test taker ability for a given administration (i.e., lots of bright, or not so bright, students that semester), the study habits of the test takers, contextual factors, such as whether a study guide was provided, and how well the instructor covered the material examined, may all contribute to having extremely high or low levels of variability in the test scores on knowledge tests.

How does the criterion-related approach to test validation help provide evidence of the accuracy of the conclusions and inferences drawn from test scores?

For a test to be useful, scores on the test should relate to a criterion of interest. Criterion-related validity examines the degree to which test scores are correlated with criterion scores. If test scores were unrelated to scores on a criterion of interest, then we'd put little credence in the conclusions drawn from test scores.

Under what conditions might we want a very high reliability coefficient?

Higher reliability is necessary when the stakes of a correct decision are higher. Medical decisions, for example, should be based on highly reliable measures.

In conducting an EFA, what would you do if a factor in the rotated factor (or, pattern) matrix were composed of items that seem to having nothing in common from a rational or theoretical standpoint?

If the items loading on a factor seemingly have no theoretical relationship to one another, then either (a) the factor analysis could be re-computed specifying a fewer number of factors, or (b) the items could be deleted from the scale.

What are some of the limitation of CTT?

In CTT, error is considered random. Therefore, systematic errors, or biases, are not considered in CTT. True scores are test dependent. Item statistics such as difficulty and discrimination are sample dependent.

Often times in a classroom environment, you might have more students (respondents) than you have items. Does it pose a problem for interpreting your item analysis statistics?

In a strict sense, yes. If you have a small number of subjects per item, and especially when we have more items than subjects, the statistics that result can be very unstable. However, in a classroom situation you will have no other choice but to "use what you have." That said, one needs to remain conscious of the fact that the resulting item statistics may not be very stable and thus should not be given an inordinate amount of deference.

When might it be preferable to use CTT-IA instead of IRT?

In most cases, IRT models will provide stronger and more generalizable estimates of item parameters than CTT-IA procedures. However, when sample sizes are very small and generalizability is not a major concern, then simpler CTT-IA procedures may be preferable (or the only real option). For example, in a classroom situation with 20 students, it would be foolish to run an IRT model to estimate item parameters. However, CTT-IA might provide some useful information regarding which items are performing relatively well compared to others. However, with this small of a sample, we need to be cautious and realize that the precision of the item statistics is less than optimal. As noted in the model overview, Ellis and Mead (2002) and Zickar and Broadfoot (2008) provide a very balanced comparison of CTT-IA and IRT estimation procedures.

Is there ever a time when a .25 p-value is good? How about a 1.00 p-value?

In most instances we are looking for difficulty indexes (e.g., p-values) in the moderate range (.40 to .60), because they allow for the highest possible discrimination among test takers (maybe a little higher if we account for guessing). However, there may be instances where we want extremely high or low p-values. For example, at the beginning of a test it is common to have a few easy items (at or near 1.00 p-value) to allow test takers to "warm-up" and build confidence. As a result, we may need a few more difficult items to balance off these easier items. However, they should have strong discrimination indexes to justify their use. In addition, if we are working with special populations (e.g., the gifted or below average) we may need more items at the extreme in terms of p-values to better distinguish test takers in these groups.

Which sources of error are of primary concern in test-retest reliability? In parallel forms and internal consistence reliability? In inter-rater reliability?

In test-retest reliability, the errors we are primarily concerned with include those impacting a change in the examinees, such as memory effects, true score fluctuation, and guessing. In parallel forms and internal consistency reliability, the error we are most concerned about is the degree to which the test forms are equivalent. In inter-rater reliability, the error source of particular interest is error in scoring.

What might inflate an observed correlation between test scores and criterion scores? Explain.

Inadequate sample size or criterion contamination may inflate an observed correlation.

What factors might attenuate an observed correlation between test scores and criterion scores? Explain.

Inadequate sample size; Unreliability in the predictor or criterion; Or restriction in range of the predictor, criterion, or both, can all contribute to attenuated correlation coefficients.

Can a test that is determined to be biased, still be a fair test? Alternatively, can a test that is determined to be unfair, still be an unbiased test?

It is hard to image a test known to be biased being viewed as fair. Predictive bias, by definition, means the same test scores are associated with different criterion scores for different groups. The only ones who would likely view the test as "fair" would be those benefiting from the bias in the test. Alternatively, a test that is viewed as "unfair" can show no indication of predictive bias. Thus, just because a test is labeled as "unfair," it can still be unbiased in the psychometric sense of predictive bias. Of course there is also the issue of measurement bias, which is discussed in Module 20.

Why is it important to know the level of measurement of our data before we begin the scaling process?

Just as when we attempt to run any descriptive or inferential statistics, we must know the level of measurement of our variables before we map out any proposed analyses in terms of the scaling process. Higher levels of measurement allow us to perform more sophisticated scaling analyses on a set of test scores.

Provide an example of each of the four types of test equivalence identified by Lonner (1990).

Lonner identified four forms of equivalence: content, conceptual, functional, and scalar. Examples of each follows: • Content equivalence (Are the items relevant to a new group of test takers?): A cognitive ability test that used a large number of sports-related analogies will likely be irrelevant to those uninterested in sports. • Conceptual equivalence (Across cultures, is the same meaning attached to the terms used in the items?): Any item that includes an idiomatic expression is likely to translate poorly to another language. • Functional equivalence (Do the behavioral assessments function equivalently across cultures?): Use of the Bem Sex Role Inventory on a group of current undergraduate students would be inappropriate due to the significant cultural changes in sex roles over the years since the publication of the original instrument. • Scalar equivalence (Do different cultural groups achieve equivalent means and standard deviations on the test?): Very different mean scores can demonstrate a lack of scale equivalence across two cultural groups.

Why are some authors (e.g., Cortina, 1993; Schmitt, 1996) cautious about the interpretation of coefficient alpha?

Many researchers mistakenly interpret a high alpha reliability coefficient as an indication of unidimensionality of a test. However, as Cortina and Schmitt show, multidimensional tests can have high coefficients alpha values. Thus, if one wants to determine the dimensionality of a test they must run a factor analysis (exploratory or confirmatory, see Module 18). A high coefficient alpha is an indication of internal consistency, not necessarily unidimensionality.

Why do you think that intercept bias is much more common than slope bias?

Most of the research seems to indicate that when differential or single group validity is found it is a result of having small sample sizes for one of the groups and thus less power for that group. Precision analysis would indicate that the confidence intervals for the regression lines for the target and referent groups overlap substantially. Thus, the test predicts equally well for both groups. However, intercept bias occurs when lower scores on the test are associated with lower or higher scores on the criterion. Thus the "bias" that may be occurring to deflate test scores may or may not also be at play in systematically lowering the criterion scores.

Why do you think it is so difficult to scale more than one dimension (i.e., people, stimuli, and responses) at once?

Most scaling procedures hold constant two dimensions and then attempt to scale the third dimension. The most prominent of these dimensions is the scaling of individuals. Typically, the items or responses themselves are scaled a single time, but as new individuals enter the mix, they need to be scaled as well. Hence, it is much more common to scale individuals (persons) than it is items or responses. The psychometric process becomes much more complex however, when more than one dimension is scaled at one time. This is because of the potential for strong interaction effects between the individuals and the items and/or responses. In addition, much more complex multivariate statistical procedures are needed to model the data.

Is quantifying content validity through the use of the CVI, CVF, or other similar method necessary to establishing content validity? Explain.

No. Although quantifying content validity using these procedures can be useful, many other judgmental methods can also provide essential information regarding content validity. Indeed, many academic tests receive much less formal review by subject matter experts (such as another teacher), but even an informal evaluation of a test can be very useful for providing evidence of the test's validity.

Given that you cannot guess on short answer essay questions, would they by default, be more reliable?

Not necessarily. If we focus on inter-rater reliability, we would have some concerns regarding short answer essay questions that we would not have for multiple-choice questions.

How do you know whether to calculate the discrimination index (which contrasts extreme groups), the biserial correlation, or the point-biserial correlation coefficient as your item discrimination statistic?

Not that long ago, contrasting groups was a popular method of determining how discriminating an item was. Unfortunately, a lot of information is lost using this method as only the bottom and top 27% or 33% of test takers are used to compute the index. Thus, the biserial and point-biserial methods (which use all the test scores) are much more informative, powerful, and precise. When the dichotomy (correct/incorrect) can be considered a "true" dichotomy, then the point-biserial correlation works fine. However, in many instances, the correct/incorrect dichotomy is an artificial dichotomy. That is, there is a true underlying continuous distribution of knowledge of the material assessed by a given item. However, for ease of scoring we provide a dichotomous outcome (correct/incorrect) instead of providing partial credit for non-keyed answers. In this latter instance, the biserial correlation would be a more appropriate index of discrimination as it corrects for this artificial dichotomy. However, this correction factor is not very precise when the dichotomy results in extreme difference (e.g., more than a 90/10 split). We discuss this in more detail in Module 11. Crocker and Algina (1986) also provide an excellent discussion on the distinction on pages 317-320.

What could an instructor do if a student asserted that a classroom exam was not content valid?

Objections to an instructor regarding this issue are all too often ignored. Rather, instructors should carefully listen to student concerns regarding the face validity of a test, and thoughtfully determine whether or not these concerns are valid. Such feedback can be invaluable for test revisions. A good place to start is to review the material covered for the exam and determine if the exam is representative of the weight placed on the various material in both the book and lecture/seminar portions of the material.

How does one go about narrowing down the seemingly endless list of potential "omitted variables" in moderated regression analysis used to determine test bias?

Omitted variables are somewhat similar to the moderator variables discussed in Module 10 on meta-analysis. Remember, those were determined only after reading the relevant literature and forming hypotheses. A similar approach would be used with the potential omitted variables in moderated regression. While it helps if one is intimately familiar with the area under investigation, one should also be aware of the practical and day-to-day issues faced by test users who implement such tests on a daily basis. Thus, having focus groups with test users serving as subject matter experts (SMEs) for identifying potential omitted variables would also be a good idea.

Under what conditions might you choose to use an orthogonal rotation of factors in an EFA? An oblique rotation?

Orthogonal rotations are preferred when we expect the resulting factors to be unrelated. Oblique rotations are preferred when we suspect the resulting factors will be correlated.

What could a student do if he or she thought a classroom exam was not content valid?

Perhaps this is a question that every student has at one time or another considered, and often came to the conclusion that he/she could do little or nothing. Of course, the first step must be to speak with the instructor who developed the test. A more advanced student may also attempt his/her own "mapping" of the test content across the stated objectives of the test/class to see if the test is consistent with the objectives and/or material covered.

Explain why a thorough understanding of the construct measured is essential to the validation process.

Psychological constructs are not directly measureable. Since psychological constructs are not tangible in the same was as height or weight, we must carefully define exactly what is meant by the construct prior to attempting to observe or measure the construct. A clear definition of what the construct is - and what it is not - is a necessary prerequisite of measurement of the construct.

What is the difference between psychometrics and psychological scaling?

Psychological scaling refers to scaling differences among statements reflecting an attribute. Psychometrics refers to the scaling of differences among subjects on an attribute. Psychological scaling is just one component in the psychometric process. However, other procedures discussed throughout the book, such as test development, scale analysis, and related procedures for developing new and better measurement instruments, also are part of the concept of psychometrics.

How is reliability defined in terms of CTT?

Reliability is the ratio of true score variance relative to total variance. Another way of conceptualizing reliability is the amount of observed score variance that is due to true score variance. In either case, it should be clear that increases in error variance decreases reliability.

Under what conditions might we accept a low reliability coefficient for a psychological measure?

Research generally demands lower reliability estimates than applied uses of measures, since the resulting stakes are often lower. Even so, most researchers consider a reliability of at least .70 to be minimally acceptable.

What unique information do IRFs and IIFs provide for test development and revision?

Similar to Question 5 above, it's the ability to see patterns and trends that is most beneficial. Numerical tables make it extremely difficult to see such patterns and trends.

What other factors (besides a truly biased test or an omitted variable) might be falsely suggesting test bias when in fact the test is not biased?

Similar to the answer for question 4 above, whatever factors are deflating (or inflating) test scores for one group, may also be deflating (or inflating) criterion scores. As a result, we may identify intercept bias that is an artifact of the situation. For example, minority students tend to have lower standardized test scores (the predictor). They also tend to drop out of college at a higher rate (the criterion). However, both may be due to the fact that they are provided less social and financial support, on average, in their family environment than white students.

Why are longer tests generally more reliable than shorter tests? What conditions must be met for this to be true?

Since error is considered random in CTT, the more items composing a test, the more measurement errors will tend to cancel each other out. If the number of items on a test is doubled, true score variance increases fourfold. However, error variance is only doubled. Reliability thus increases with longer tests since the resulting observed scored reflects a greater amount of true score variance relative to error variance.

What does a 95% confidence interval of the mean tell us? How about a 99% confidence interval for an individual score?

The 95% confidence interval of the mean tells us that we are 95% certain that our interval, as constructed, includes the population parameter (μ). However, we must be diligent and remind students that μ is a constant and not a variable. Therefore, what changes from study to study is the interval we construct, based on the sample mean, variability, and size, not μ. The 99% confidence interval for an individual score requires that we know the standard error of measurement (which requires knowledge of the sample standard deviation and the reliability of the test). Similar to the confidence interval for the mean, we are attempting to obtain an interval estimate for a constant (not a variable). In this case, an individual's True Score. Thus, our interpretation is very similar. Specifically, the probability of our interval, as constructed, containing an individual's true score is .99.

How much can we shorten an existing measure and still maintain adequate reliability? (See Case Study 5.2.)

The Spearman-Brown Prophecy Formula will allow you to determine this.

How could a small organization determine which selection tests might be appropriate for use in selection of new employees?

The concept of synthetic validity argues that the validity of a test could be generalized between similar contexts. Thus, a small organization could determine which valid selection tests were used by large organizations with similar jobs, and use those same selection tests in their own organization. Further, meta-analysis has determined typical criterion-related validity estimates for a wide variety of types of selection tests across a large number of jobs.

In conducting an EFA, describe the procedure you would follow to determine whether items found to load on a factor actually form a meaningful, interpretable subdimension.

The content of each item loading highly on a factor would be closely examined. Some rationale for why the items load on the factor would then be generated. Typically, the item or items with the highest loading provide the clearest indication of what the factor should be labeled. Items that seem rationally unrelated to the others could be deleted.

In situations where individuals are unlikely to omit any of the questions on purpose, is it appropriate to correct for guessing?

This is probably more an issue of whether correcting for guessing in this situation is of any practical use. Since no omits are expected, there will be a direct linear relationship between the original scores and the corrected scores. What comes into play in the classroom situation, however, is that it is criterion referenced. That is, a student must answer 90% correct or higher to earn an A. Thus, correcting for guessing may impact students' ability to obtain a given grade.

What are the different sources of error that can be assessed with classical test theory reliability analysis?

Three basic sources of error can be assessed using classical reliability theory. These include changes in examinees scores over time (stability); content sampling - both within measures (internal consistency) and across measures (equivalence), and consistency across raters (inter-rater reliability). One of the major drawbacks of using classical reliability theory is that typically only one source of error is assessed with a given index. Thus, modern reliability theory (such as generalizability theory) allows one to assess more than one source of error at a time.

If conducting a correction for restriction in range of the predictor variable in a concurrent criterion-related validity study, who is the population referring to? How might you best estimate the population (i.e., unrestricted) predictor standard deviation?

The population is referring to all potential test-takers. For example, these might be the population of applicants to a job. The best way to estimate the population (i.e., unrestricted) predictor standard deviation is to administer the measure to a large sample taken directly from the intended population.

What factors should be considered when determining whether to request test accommodation is reasonable?

The primary factor that should be considered is the intent of the test and the nature of the disability. If the disability is directly related to the construct being assessed (such as deafness on a hearing test), an accommodation is inappropriate. If the disability is not related to the construct that the test is intended to measure, then an accommodation should be provided. The ideal accommodation will be minimally disruptive of the typical testing procedure, but will meet the needs of the disabled test taker.

Upon their introduction to factor analysis, many students are likely to agree with Pedhazur and Schmelkin's (1991) assertion that factor analysis is like "a forest in which one can get lost in no time" (p. 590). Understanding, however, might be aided by identifying the elements that you find confusing. List two to three aspects of factor analysis that, if clarified, would help you to better understand this family of procedures.

The intent of this item is to allow students the opportunity to voice their confusion with factor analysis - and to provide the instructor an opportunity to clarify those content areas that students frequently misunderstand.

What are the major advantages of IRT over CTT-IA?

The major advantage of IRT over CTT-IA is the dramatic increase in the amount of information obtained regarding each item, each test taker, and the test as a whole. In addition, this increased level of information is also sample invariant. That is, a major disadvantage of CTT-IA is that the results (e.g., item parameters - difficulty, item discrimination) are highly dependent on the sample used to generate them. Thus, they will likely differ (sometimes dramatically) from one sample to another. On the other hand, IRT parameters do not depend on the sample used to compute them and thus can be generalized from one sample of test takers to another.

What are the advantages and disadvantages of the 1-PL, 2-PL, and 3-PL IRT models?

The major advantage of the 1-Parameter Logistic Model (which estimates only item difficulty) is its simplicity. If in fact the 1-PL model fits the data well, we have parsimony in estimating our model. However, this model may be unrealistic in many situations in that different items often differentially discriminate. Thus a 2-PL model (that estimates both difficulty and discrimination) would be needed. In some instances, a third (guessing) parameter needs to be estimated, as different items are easier to guess than others. However, as the complexity of the model increases, so too must the evidence and justification for it.

What are the differences among predictive, concurrent, and postdictive criterion-related validation designs?

The three criterion-related designs differ in both order in which test scores and criterion scores are collected, and the selection of the sample used. In predictive criterion-related validity studies, the collection of predictor scores occur prior to the collection of criterion scores. Since criterion scores have yet to be determined, random selection is possible in predictive criterion-related research designs. In concurrent criterion-related validity, predictor and criterion scores are collected at about the same time. Usually samples are pre-determined, so random selection is not possible in this design. Finally, in postdictive criterion-related validity studies, criterion scores are collected before test scores. Once again this is a predetermined sample.

What advantages do IRFs (i.e., graphs) have over simply examining the item parameters in table form?

There is a direct analogy here with using descriptive graphs (e.g., frequency tables, bar charts, stem-and-leaf displays, box-plots) versus tabular data. Often times a picture is worth a thousand words. That holds true in statistics as well. In the case of statistics, it is relatively easy to get lost in reams of printed tables. However, it is often easier to see general patterns and make comparisons by examining graphs instead of tables. For example, looking at the three figures in Figure 19.1, the ai, bi, and ci parameters are all easily distinguished across the three graphs. Now these differences are of course exaggerated in this instance for pedagogical purposes, but still in many applications of IRT, similar differences can be found by simply examining item response function plots.

Assuming we did use the same regression line for all three groups, which group would be most likely to raise claims of test bias? Unfairness?

This is a thorny question because of the nature of the criterion variable - suicide risk. However, most likely it would be advocacy groups for both the young and old, as those two groups would be the ones under predicted. As can be seen in Figure 11.1, we see intercept, but not slope, bias when comparing the young and old. Comparing the young and middle age, we see slope, but not intercept, bias. When we compare the old and middle age, we see both slope and intercept bias. Therefore, every group could claim some form of predictive bias. As for fairness, well that is a socio-political issue. As a result, any group can claim a test is unfair at any time regardless of whether the test shows signs of predictive bias.

How do you decide on which external criterion to use when computing an item-criterion index?

This is a very context dependent decision. It would depend largely on the reason(s) the test is being administered in the first place. That is, are we administering the test in hopes of predicting a given outcome (e.g., a child's success in school or an employee's likelihood of quitting an organization within a year of being hired)? If so, then that criterion will be most useful for us.

What is the difference between scaling and classification?

Typically the difference is in terms of the level of measurement of the variable to be examined. For example, when we have nominal level data we can only qualitatively distinguish scores, thus allowing us to classify individuals into different groups (e.g., introverted versus extroverted). However, when we have at least ordinal level data we can quantitatively distinguish scores thus allowing us to scale a variable such as extroversion. As a result, scaling allows us to determine the amount of a given construct (e.g., extroversion) that test takers demonstrate and not just their broad classification as introverted or extroverted. As an example, the Meyers-Briggs Type Indicator (MBTI) classifies individuals as introverted or extroverted (their type). While individuals do obtain a score on their level of introversion versus extroversion, most applications of the MBTI focus on the general classification of introverted verses extroverted. Conversely, most Big 5 measures of extroversion, such as the NEO, provide a more detailed score on the introversion/extroversion dimension, which can be compared to a group's mean or other normative data for interpretation purposes.

Which stakeholders in the testing process (see Module 1) are responsible for determining whether test bias actually exists or not?

Ultimately it is the Test User who must determine this issue. The Test Developer may investigate the issue of bias during the test development process, however a test may show bias in one situation but not another. Therefore, the ultimate responsibility of for determining whether the application of a given test is biased rests with the Test User. Society, of course, will ultimately determine the issue of fairness through laws and regulations.

The unified view of test validation regards all aspects of validation as reaching for the same goal. What is the overall goal of test validation?

Validation efforts are intended to provide a "compelling argument that the available evidence justifies the test interpretation and use" (Messick, 1995, p. 744)

Why would you want to understand the dimensionality of a scale?

We must understand the dimensionality of a measure to know what is being assessed by its use. In developing a new rationally derived measure, we'd want to know whether our measure contained the expected dimensionality identified in our definition of the construct.

Although it is empirically possible to correct for attenuation due to unreliability in a predictor, this is a violation of ethics if we intend to use the predictor for applied purposes. Explain why we can ethically correct for unreliability in the criterion but cannot ethically correct for unreliability in a predictor.

When examining the criterion-related validity of a test, we are basically assessing the usefulness of that test for predicting criterion scores. Unfortunately, the obtained correlation will be attenuated if the criterion has less than perfect reliability. Since our focus of attention is on evaluating the usefulness of the test (not the criterion), we can compute the validity between test scores and a perfectly reliable criterion of interest by using the correction for unreliability in the criterion formula. However, we should not correct for unreliability in the predictor (i.e., test) because the unreliability in the test will still be present when we use it for our applied purpose. Since a test with less than perfect reliability will have error when the test is actually used, we must include this amount of error in our evaluation of the test's criterion-related validity.

What are the advantages of using a scatter plot in addition to the Pearson product moment correlation?

While a scatter plot does give some indication of the strength and direction of a relationship between two variables, those properties of the relationship can also (and more precisely) be obtained from the correlation index directly. However, the real advantage of inspecting the scatter plot before taking the correlation index at face value is that the scatter plot will identify: 1) possible univariate and bivariate outliers, 2) non-linear relationships, 3) potential restriction of range issues, 4) possible sub-population differences (e.g., different relationships for men versus women), and 5) heteroscedasticity along the regression line. Thus, if the scatter plot is not inspected first, then the researcher may erroneously conclude that there is no (or a very small) relationship between two variables when in fact one of the factors above is suppressing the correlation index making it appear smaller than it really is.

Are descriptive statistics or inferential statistics used more in applied psychological measurement?

While both descriptive and inferential statistics are used in applied psychological measurement, more often than not we are most interested in simply describing a set of test scores with frequency tables, scatter plots, and descriptive statistics point estimates, such as means, standard deviations, and skewness.

The content approach to test validation has historically relied heavily on expert judgment, though newer approaches argue that any competent individual might be capable of providing validity evidence if provided a clear definition of the construct of interest. Discuss the degree to which you feel it is necessary to rely on SMEs to provide evidence of content validity.

While human judgment is certainly fallible, it is very useful to have expert judgment of the appropriateness of a test. Such critical inspection of a test is essential to the validation process, particularly when a content approach is used. Nonetheless, newer approaches to content validation which use Likert-type ratings to evaluate the relationship between items and content domain often use non-experts. These approaches make the recruitment of raters easier, and thus result in larger samples of raters. The requirement for such raters is that they are unbiased, and of sufficient intellectual ability to evaluate the correspondence between items and theoretical constructs.

Will your criteria for evaluating your item difficulty and discrimination indexes change as the format of the item changes? (e.g., true-false, three, four, or five-option multiple choice, Likert scaling)

Yes, as the number of options changes, the chance that test takers can correctly guess the answer also changes. For example, test takers have a 50-50 chance of answering a true-false question correctly, but only a 20% chance of answering a five-option multiple-choice question correctly. Of course, this assumes that the test taker is randomly guessing and that the non-keyed options are equally attractive (i.e., none are so easily identified as false that they can be ruled out, effectively reducing the number of options this increasing the probability of correctly guessing). Random guessing is rare.

Can reliability estimates be used to provide evidence of the construct validity of test scores? Explain.

Yes. Cronbach and Meehl argued that a number of methods could be used to provide evidence of the construct validity of a measure. The assessment of internal consistency reliability would be consistent with their recommendation for studies of the internal structure of a measure.

Will your criteria for evaluating your item difficulty and discrimination indexes change if a test is norm referenced versus criterion referenced?

Yes. The answer to Question 4 above, for example, assumes we are using a norm-referenced standard. That is, we are most interested in distinguishing or differentiating test takers on the construct(s) being assessed. However, with criterion-referenced standards, we wish to see if each test taker knows a given set of material or possesses a certain level of skills or ability. In this latter instance we most likely will have a much higher average p-value across items. For example, we might require that employees pass a safe driving test with a score of 75% to be eligible to drive a company car. Thus, the average p-value for items likely to be at or above .75 in this criterion-referenced standard situation.

List the different types of decisions that you need to make when conducting an EFA.

You need to decide the extraction method and the rotation method. In addition, you will need to decide the number of factors that you wish to retain. And you will need to decide the threshold for factor loadings that you will consider to signify that an item loads on a particular factor. Finally, you will need to decide what the psychological meaning is of each particular factor.

If you had recently translated a test into a different cultural context, how would you assess each of the four types of equivalence?

• Content equivalence could be assessed by conducting a study that examined both groups of test takers' familiarity with the material presented in the item stems. • Conceptual equivalence could be assessed by conducting a "think aloud" study in which test takers from both groups are asked to verbally explain what the item is asking. • An assessment of functional equivalence would require an understanding of both cultures. Perhaps focus groups could be composed in both groups to assess the equivalence of the meaning of the construct to be examined in the measure. • Administering the measure to a large sample of members of both groups, and then comparing the group means could examine scalar equivalence and standard deviations of test scores.


Ensembles d'études connexes

chapter 27 caring for clients with hypertension

View Set

Microeconomics: Chapter 13 Graded Homework

View Set

Organizing and Outlining your Presentation

View Set

MCQ- Legal and Ethical decision making

View Set