PSYCH ASSESS - CH. 5 (RELIABILITY)

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Random error

Sometimes referred to as "noise," this source of error fluctuates from one testing situation to another with no discernible pattern that would systematically raise or lower scores.

false - An estimate of test-retest reliability may be most appropriate in gauging the reliability of tests that employ outcome measures such as reaction time or perceptual judgments

T OR F: An estimate of test-retest reliability may be most appropriate in gauging the validity of tests that employ outcome measures such as reaction time or perceptual judgments

TRUE

T OR F: If we are to come to proper conclusions about the reliability of the measuring instrument, evaluation of a test-retest reliability estimate must extend to a consideration of possible intervening factors between test administrations.

false - In everyday conversation, reliability is a synonym for dependability or consistency.

T OR F: In everyday conversation, validity is a synonym for dependability or consistency.

false - External to the test environment in a global sense, the events of the day may also serve as a source of error.

T OR F: It is unlikely that the events of the day may serve as a source of error.

KR-21

The ____________ formula may be used if there is reason to assume that all the test items have approximately the same degree of difficulty.

test-retest measure

The _______________ is appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time, such as a personality trait

decision study

The ________________ is designed to tell the test user how test scores should be used and how dependable those scores are as a basis for decisions, depending on the context of their use

standard error of the difference

The _________________ between two scores will be larger than the standard error of measurement for either score alone because the former is affected by measurement error in both scores

reliability coefficient

The __________________ helps the test developer build an adequate measuring instrument, and it helps the test user select a suitable test

standard error of measurement

The __________________ is the tool used to estimate or infer the extent to which an observed score deviates from a true score

Spearman-Brown

The ____________________ formula allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test

split-half reliability

The computation of a coefficient of ____________ generally entails three steps: Step 1. Divide the test into equivalent halves. Step 2. Calculate a Pearson r between scores on the two halves of the test. Step 3. Adjust the half-test reliability using the Spearman-Brown formula (discussed shortly).

coefficient of equivalence.

The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability, which is often termed the ______________________

coefficients of generalizability

The influence of particular facets on the test score is represented by . These coefficients are similar to reliability coefficients in the true score model

homogeneous

The more ______________ a test is, the more inter-item consistency it can be expected to have

reliability

The term _________________ refers to the proportion of the total variance attributed to true variance

parallel forms reliability

The term ___________________ refers to an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means and variances of observed test scores are equal.

alternate forms reliability

The term ____________________ refers to an estimate of the extent to which these different forms of the same test have been affected by item sampling error, or other error.

dichotomous test items

(test items or questions that can be answered with only one of two alternative responses, such as true-false, yes-no, or correct-incorrect questions

polytomous test items

(test items or questions with three or more alternative responses, where only one is scored correct or scored as being consistent with a targeted trait or other construct).

systematic error

(what type of error) a 12-inch ruler may be found to be, in actuality, a tenth of one inch longer than 12 inches. All of the 12-inch measurements previously taken with that ruler were systematically off by one-tenth of an inch; that is, anything measured to be exactly 12 inches with that ruler was, in reality, 12 and one-tenth inches

reliability coefficient

. A ____________________ is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance.

Spearman-Brown formula

A __________ could also be used to determine the number of items needed to attain a desired level of reliability. In adding items to increase test reliability to a desired level, the rule is that the new items must be equivalent in content and difficulty so that the longer test still measures what the original test measured.

domain

A _____________ of behavior, or the universe of items that could conceivably measure that behavior, can be thought of as a hypothetical construct: one that shares certain characteristics with (and is measured by) the sample of items that make up the test

criterion-referenced test

A ________________ is designed to provide an indication of where a testtaker stands with respect to some variable or criterion, such as an educational or a vocational objective

heterogeneous

A __________________ (or nonhomogeneous) test is composed of items that measure more than one trait.

generalizability study

A __________________ examines how generalizable scores from a particular test are if the test is administered in different situations

dynamic characteristic

A __________________- is a trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences

universe score

According to generalizability theory, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained. This test score is the __________________

decision study

After the generalizability study is done, Cronbach et al. (1972) recommended that test developers do a ______________, which involves the application of information from the generalizability study

homogeneity; homogeneous

An index of inter-item consistency, in turn, is useful in assessing the of the test. Tests are said to be _____________ if they contain items that measure a single trait

reliability

And whereas in everyday conversation ______________ always connotes something positive, in the psychometric sense it really only refers to something that is consistent—not necessarily consistently good or bad, but simply consistent

odd-even reliability

Another acceptable way to split a test is to assign odd-numbered items to one half of the test and even-numbered items to the other half. This method yields an estimate of split-half reliability that is also referred to as ______________________

criterion-referenced tests

Unlike norm-referenced tests, _______________ tend to contain material that has been mastered in hierarchical fashion

reliability

Broadly speaking, in the language of psychometrics ___________________ refers to consistency in measurement.

speed test

By contrast, a ______________ generally contains items of uniform level of difficulty (typically uniformly low) so that, when given generous time limits, all testtakers should be able to complete all the test items correctly.

Spearman- Brown formula

By determining the reliability of one half of a test, a test developer can use the _______________ formula to estimate the reliability of a whole test

reliability coefficient

By employing the ___________________ in the formula for the standard error of measurement, the test user now has another descriptive statistic relevant to test interpretation, this one useful in estimating the precision of a particular test score

true variance; error variance.

Variance from true differences is ___________, and variance from irrelevant, random sources is ___________

, inter-scorer reliability

Variously referred to as scorer reliability, judge reliability, observer reliability, and inter-rater reliability, ________________ is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure

standard error of measurement

We may define the_____________________ as the standard deviation of a theoretically normal distribution of test scores obtained by one person on equivalent test

standard error of the difference

Comparisons between scores are made using the ________________________, a statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant.

power test

When a time limit is long enough to allow testtakers to attempt all items, and if some items are so difficult that no testtaker is able to obtain a perfect score, then the test is a _______________

coefficient of stability

When the interval between testing is greater than six months, the estimate of test-retest reliability is often referred to as the ______________________

generalizability theory

Developed by Lee J. Cronbach (1970) and his colleagues (Cronbach et al., 1972), _____________ is based on the idea that a person's test scores vary from testing to testing because of variables in the testing situation

Random error

Examples of random error that could conceivably affect test scores range from unanticipated events happening in the immediate vicinity of the test environment to unanticipated physical events happening within the testtaker

confidence interval

Further, the standard error of measurement is useful in establishing what is called a ________________: a range or band of test scores that is likely to contain the true score.

homogeneity

___________ (derived from the Greek words homos, meaning "same," and genos, meaning "kind") is the degree to which a test measures a single factor. In other words, ____________ is the extent to which items in a scale are unifactorial.

domain sampling theory,

____________ seek to estimate the extent to which specific sources of variation under defined conditions are contributing to the test score.

heterogeneity

_____________ describes the degree to which a test measures different factors

alternate-forms reliability

In _______________, a reliability estimate is based on the correlation between the two total scores on the two forms.

domain sampling theory,

In _______________, a test's reliability is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample

split-half reliability

In ________________, a reliability estimate is based on the correlation between scores on two halves of the test and is then adjusted using the Spearman-Brown formula to obtain a reliability estimate of the whole test

coefficient alpha

In contrast to KR-20, which is appropriately used only on tests with dichotomous items, ______________ is appropriate for use on tests containing nondichotomous items

measurement error

In general, the term _______________ refers to, collectively, all of the factors associated with the process of measuring some variable, other than the variable being measured

parallel forms

In theory, the means of scores obtained on _______________ correlate equally with the true score. More practically, scores obtained on _____________ correlate equally with other measures.

static characteristic

In this instance, obtained measurement would not be expected to vary significantly as a function of time, and either the test-retest or the alternate-forms method would be appropriate

split-half reliability

It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice (because of factors such as time or expense).

systematic or random

Measurement error, much like error in general, can be categorized as being either _____________ or ____________

Pearson

a _____________ r may be thought of as dealing conceptually with both dissimilarity and similarity. Accordingly, an r value of −1 may be thought of as indicating "perfect dissimilarity

mini-parallel-forms

a primary objective in splitting a test in half for the purpose of obtaining a split-half reliability estimate is to create what might be called "____________________," with each half equal to the other—or as nearly equal as humanly possible—in format, stylistic, statistical, and related aspects.

transient error,

a source of error attributable to variations in the testtaker's feelings, moods, or mental state over time.

domain sampling theory

and is better known today in one of its many modified forms as generalizability theory.

Examiner-related variable

are potential sources of error variance. The examiner's physical appearance and demeanor—even the presence or absence of an examiner—are some factors for consideration here.

Alternate forms

are simply different versions of a test that have been constructed so as to be parallel. Although they do not meet the requirements for the legitimate designation "parallel," ________________ of a test are typically designed to be equivalent with respect to variables such as content and level of difficulty

item sampling or content sampling

One source of variance during test construction is __________ or _____________, terms that refer to variation among items within a test as well as to variation among items between tests

testtaker variable

Other potential sources of error variance during test administration are ____________________. Pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication can all be sources of error variance

assessment

Potential sources of nonsystematic error in such an ____________ situation include forgetting, failing to notice abusive behavior, and misunderstanding instructions regarding reporting.

Average proportional distance

Rather than focusing on similarity between scores on items of a test (as do split-half methods and Cronbach's alpha), the APD is a measure that focuses on the degree of difference that exists between item scores

homogeneous

Recall that a test is said to be ___________ in items if it is functionally uniform throughout. Tests designed to measure one factor, such as one ability or one trait, are expected to be __________ in items.

test-retest reliability

Recall that a_____________ estimate is based on the correlation between the total scores on two administrations of the same test

decision study,

developers examine the usefulness of test scores in helping the test user make decisions.

internal consistency estimate of reliability or as an estimate of inter-item consistency

evaluation of the internal consistency of the test items

sampling error

ex. the extent to which the population of voters in the study actually was representative of voters in the election

false - An estimate of the reliability of a test can be obtained without developing an alternate form of the test and without having to administer the test twice to the same people

false - An estimate of the reliability of a test cannot be obtained without developing an alternate form of the test and without having to administer the test twice to the same people

alternate-forms or parallel-forms

if you have ever wondered whether the two forms of the test were really equivalent, you have wondered about the _____________ or _______________ reliability of the test.

generalizability theory

instead of conceiving of all variability in a person's scores as error, Cronbach encouraged test developers and researchers to describe the details of the particular test situation or universe leading to a specific test score.

Rasch model

is a reference to an IRT model with very specific assumptions about the underlying distribution.

. Random error

is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process.

Test-retest reliability

is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test.

standard error of measurement

is an index of the extent to which one individual's scores vary over tests presumed to be parallel. In accordance with the true score model, an obtained test score represents one point in the theoretical distribution of scores the testtaker could have obtained.

split-half reliability

is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.

Inter-scorer reliability

is often used when coding nonverbal behavior. For example, a researcher who wishes to quantify some aspect of nonverbal behavior, such as depressed mood, would start by composing a checklist of behaviors that constitute depressed mood

KR-20 (Kuder-Richardson formula 20)

is the statistic of choice for determining the inter-item consistency of dichotomous items, primarily those items that can be scored right or wrong (such as multiple-choice items).

item response theory

it models the probability that a person with X amount of a particular personality trait will exhibit Y amount of that trait on a personality test designed to measure it

The standard error of measurement,

it provides an estimate of the amount of error inherent in an observed score or measurement.

Inter-rater consistency

may be promoted by providing raters with the opportunity for group discussion along with practice exercises and information on rater accuracy

coefficient alpha

may be thought of as the mean of all possible split-half correlations, corrected by the Spearman-Brown formula

Parallel forms

of a test exist when, for each form of the test, the means and the variances of observed test scores are equal.

Latent-trait theories

propose models that describe how the latent trait influences performance on each test item. Unlike test scores or true scores, latent traits theoretically can take on values from −∞ to +∞ [negative infinity to positive infinity]

The standard error of measurement,

provides a measure of the precision of an observed test score

systematic error

refers to a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measure

error

refers to the component of the observed test score that does not have to do with the testtaker's ability

Inter-item consistency

refers to the degree of correlation among all the items on a scale. A measure of _________________ is calculated from a single administration of a single form of a test.

discrimination

signifies the degree to which an item differentiates among people with higher or lower levels of the trait

variance

statistic useful in describing sources of test score variability is the _____________- (σ2)—the standard deviation squared

true

t or F: Sources of error variance that occur during test administration may influence the testtaker's attention or motivation.

false -. Of the three types of estimates of reliability, measures of internal consistency are perhaps the most compatible with domain sampling theory

t or f: . Of the three types of estimates of reliability, measures of inter-scorer reliability are perhaps the most compatible with domain sampling theory.

true

t or f: A calculated APD of .25 is suggestive of problems with the internal consistency of the test.

false - . A coefficient of inter-rater reliability, for example, provides information about error as a result of test scoring

t or f: A coefficient of inter-rater reliability, for example, provides information about accuracy as a result of test scoring

true

t or f: A measure of reliability, therefore, depends on the variability of the test scores: how different the scores are from one anothe

true

t or f: Although a homogeneous test is desirable because it so readily lends itself to clear interpretation, it is often an insufficient tool for measuring multifaceted psychological variables such as intelligence or personality.

true

t or f: Because a homogeneous test samples a relatively narrow content area, it is to be expected to contain more inter-item consistency than a heterogeneous test.

true

t or f: Because a measure of the reliability of a speed test should reflect the consistency of response speed, the reliability of a speed test should not be calculated from a single administration of the test with a single time limit

true

t or f: Because error variance may increase or decrease a test score by varying amounts, consistency of the test score—and thus the reliability—can be affected.

false - Because the reliability of a test is affected by its length, a formula is necessary for estimating the reliability of a test that has been shortened or lengthened

t or f: Because the reliability of a test is affected by its length, a formula is burdensome for estimating the reliability of a test that has been shortened or lengthened

true

t or f: Because true differences are assumed to be stable, they are presumed to yield consistent scores on repeated administrations of the same test as well as on equivalent forms of tests

true

t or f: Coefficient alpha is the preferred statistic for obtaining an estimate of internal consistency reliability

false - Coefficient alpha is widely used as a measure of reliability, in part because it requires only one administration of the test

t or f: Coefficient alpha is widely used as a measure of reliability, in part because it requires multiple administration of the test

false - Error related to any of the number of possible variables operative in a testing situation can contribute to a change in a score achieved on the same test, or a parallel test, from one administration of the test to the next

t or f: Error related to any of the number of possible variables operative in a testing situation cannot contribute to a change in a score achieved on the same test, or a parallel test, from one administration of the test to the next

true

t or f: If a split-half procedure is used, then the obtained reliability coefficient is for a half test and should be adjusted using the Spearman-Brown formula.

true

t or f: If subjectivity is involved in scoring, then the scorer (or rater) can be a source of error variance.

true

t or f: If test items are more heterogeneous, KR-20 will yield lower reliability estimates than the split-half method

false - If the reliability of the original test is relatively low, then it may be impractical to increase the number of items to reach an acceptable level of reliability.

t or f: If the reliability of the original test is relatively low, then it may be practical to increase the number of items to reach an acceptable level of reliability.

false - If the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower.

t or f: If the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be higher.

false - In many tests, the advent of computer scoring and a growing reliance on objective, computer-scorable items have virtually eliminated error variance caused by scorer differences

t or f: In many tests, the advent of computer scoring and a growing reliance on objective, computer-scorable items have increased the error variance caused by scorer differences

false - In practice, the standard error of measurement is most frequently used in the interpretation of individual test scores

t or f: In practice, the standard error of measurement is most frequently used in the observation of individual test scores

false - In some tests of personality, examinees are asked to supply open-ended responses to stimuli such as pictures, words, sentences, and inkblots, and it is the examiner who must then quantify or qualitatively evaluate responses

t or f: In some tests of personality, examiners are asked to supply open-ended responses to stimuli such as pictures, words, sentences, and inkblots, and it is the examiner who must then quantify or qualitatively evaluate responses

false - Internal consistency estimates of reliability, such as that obtained by use of the Spearman-Brown formula, are inappropriate for measuring the reliability of heterogeneous tests and speed tests

t or f: Internal consistency estimates of reliability, such as that obtained by use of the Spearman-Brown formula, are most appropriate for measuring the reliability of heterogeneous tests and speed tests

true

t or f: It is even conceivable that significant changes in the testtaker's body weight could be a source of error variance.

true

t or f: Manuals for individual intelligence tests tend to be very explicit about scoring criteria, lest examinees' measured intelligence vary as a function of who is doing the testing and scoring.

false - Measures of reliability are estimates, and estimates are subject to error

t or f: Measures of reliability are absolute

false - Reduction in test size for the purpose of reducing test administration time is a common practice in certain situations.

t or f: Reduction in test size for the purpose of reducing test administration time is an illegal practice in certain situations.

true

t or f: Reduction in test size may be indicated in situations where boredom or fatigue could produce responses of questionable meaningfulness.

true

t or f: Subjectivity in scoring can even enter into behavioral assessment.

false - Test homogeneity is desirable because it allows relatively straightforward test-score interpretation.

t or f: Test heterogeneity is desirable because it allows relatively straightforward test-score interpretation.

false - Testtakers with the same score on a homogeneous test probably have similar abilities in the area tested.

t or f: Testtakers with the same score on a heterogeneous test probably have similar abilities in the area tested.

false - The amount of error in a specific test score is embodied in the standard error of measurement. But scores can change from one testing to the next for reasons other than error.

t or f: The amount of error in a specific test score is embodied in the standard error of measurement. But scores remain as is from one testing to the next for reasons other than error.

false - The element of subjectivity in scoring may be much greater in the administration of certain nonobjective-type personality tests, tests of creativity (such as the block test just described), and certain academic tests (such as essay examinations)

t or f: The element of objectivity in scoring may be much greater in the administration of certain nonobjective-type personality tests, tests of creativity (such as the block test just described), and certain academic tests (such as essay examinations)

true

t or f: The general "rule of thumb" for interpreting an APD is that an obtained value of .2 or lower is indicative of excellent internal consistency, and that a value of .25 to .2 is in the acceptable range

false - The greater the proportion of the total variance attributed to true variance, the more reliable the test

t or f: The greater the proportion of the total variance attributed to true variance, the less reliable the test is

false -The one variant of the KR-20 formula that has received the most acceptance and is in widest use today is a statistic called coefficient alpha.

t or f: The one variant of the KR-20 formula that has received the most acceptance and is in widest use today is a statistic called coefficient of beta

false - The longer the time that passes, the greater the likelihood that the reliability coefficient will be lower

t or f: The shorter the time that passes, the greater the likelihood that the reliability coefficient will be lower

true

t or f: The standard error of measurement can be used to set the confidence interval for a particular score or to determine whether a score is significantly different from a criterion

false - Unlike a Pearson r, which may range in value from −1 to +1, coefficient alpha typically ranges in value from 0 to 1

t or f: Unlike a Pearson r, which may range in value from −1 to +1, coefficient alpha typically ranges in value from 0 to 2

false- When it comes to calculating split-half reliability coefficients, there's more than one way to split a test—but there are some ways you should never split a test

t or f: When it comes to calculating split-half reliability coefficients, there's more than one way to split a test— and all types of "splitting" method are valid

false - Where test items are highly homogeneous, KR-20 and split-half reliability estimates will be similar

t or f: Where test items are highly heterogenous, KR-20 and split-half reliability estimates will be similar

false - , a challenge in test development is to maximize the proportion of the total variance that is true variance and to minimize the proportion of the total variance that is error variance.

t or f: a challenge in test development is to minimize the proportion of the total variance that is true variance and to maximize the proportion of the total variance that is error variance.

false - A test may be reliable in one context and unreliable in another.

t or f: a test is always reliable in all context

false

t or f: alternate forms and parallel forms are similar in nature

true

t or f: coefficient alpha (much like other coefficients of reliability) is calculated to help answer questions about how similar sets of data are

true

t or f: if the test is heterogeneous in items, an estimate of internal consistency might be low relative to a more appropriate estimate of test-retest reliability.

false -one way of estimating the reliability of a measuring instrument is by using the same instrument to measure the same thing at two points in time

t or f: one way of estimating the reliability of a measuring instrument is by using different instruments to measure the same thing at two points in time

false - Usually, but not always, reliability increases as test length increases. Ideally, the additional test items are equivalent with respect to the content and the range of difficulty of the original items

t or f: reliability does not increase as test length increases. Ideally, the additional test items are equivalent with respect to the content and the range of difficulty of the original items

false - scorers and scoring systems are potential sources of error variance

t or f: scorers and scoring systems are unlikely sources of error variance

true

t or f: the reliability of the instrument might be raised by creating new items, clarifying the test's instructions, or simplifying the scoring rules

true

t or f: traditional ways of estimating reliability are not always appropriate for criterion-referenced tests, though there may be instances in which traditional estimates can be adopted.

classical test theory

that a score on an ability test is presumed to reflect not only the testtaker's true score on the ability being measured but also error

(1) test-retest, (2) alternate or parallel forms, and (3) internal or inter-item consistency

there are basically three approaches to the estimation of reliability:

average proportional distance method

we define the _______________________ as a measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores

true score

we will define _______________ as a value that according to classical test theory genuinely reflects an individual's ability (or trait) level as measured by a particular test.

facets,

which include things like the number of items in the test, the amount of training the test scorers have had, and the purpose of the test administratio


Kaugnay na mga set ng pag-aaral

Quiz 1 (Ch 1 & Ch 2) Business Law

View Set

Analyzing an Autobiographical Essay 100%

View Set

Series 7 - Mastery Exam III #2 (Q1 - Q110)

View Set

Organizational Behavior (Chapter 1)

View Set