Test Construction

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

CTT, how does one obtain a test score (X)?

composed of 2 components -a true component (T) and an error component (E) X=T+E -

he slope (steepness) of an item characteristic curve indicate's the item's:

discrimination

For an achievement test item that has an item discrimination index (D) of +1.0, you would expect:

high achievers to be more likely than low achievers to answer the item correctly

Consensual observer drift tends to:

produce an overestimate of a test's inter-rater reliability.

You administer a test to a group of examinees on April 1st and then re-administer the same test to the same group of examinees on May 1st. When you correlate the two sets of scores, you will have obtained a coefficient of:

stability. -Test-retest reliability indicates the stability of scores over time, and the test-retest reliability coefficient is also known as the coefficient of stability.

What is a test?

-systematic method of measuring a sample of behavior

A test has a standard deviation of 12, a mean of 60, a reliability coefficient of .91, and a validity coefficient of .60. The test's standard error of measurement is equal to:

3.6 -To calculate the standard error of measurement, you need to know the standard deviation of the test scores and the test's reliability coefficient. The standard deviation of the test scores is 12 and the reliability coefficient is .91. -To calculate the standard error of measurement, you multiply the standard deviation times the square root of one minus the reliability coefficient: 1 minus .91 is .09; the square root of .09 is .3; .3 times 12 is 3.6.

A student receives a score of 450 on a college aptitude test that has a mean of 500 and standard error of measurement of 50. The 68% confidence interval for the student's score is:

400 to 500 -The standard error of measurement is used to construct a confidence interval around an obtained test score -To construct the 68% confidence interval, one standard error of measurement is added to and subtracted from the obtained score. -Since the student obtained a score of 450 on the test, the 68% confidence interval for the score is 400 to 500.

For a newly developed test of cognitive flexibility, coefficient alpha is .55. Which of the following would be useful for increasing the size of this coefficient?

Adding more items that are similar in terms of content and quality

Does a reliability coefficient provide info about what is actually being measured by a test?

No. -only indicates whether the attribute measured by the test is being assessed in a consistent, precise way -it is NEVER squared like the correlation coefficient

Can a test's reliability be measured directly?

No. must be estimated -several ways to estimate it -each involve's assessing the consistency of an examinee's score over time, across different content samples, or across different scorers and is based on assumption that variability that is consistent is true score variability, while variability that is inconsistent is measurement error

What is formula for standard error of measurement?

SEmeas = SDx (standard deviation of test scores) / rxx (reliability coefficient) -the magnitude of the standard error of measurement is affected by the standard deviation of test scores and the test's reliability coefficient -the lower the test SD and the higher the reliability coefficient, the smaller the standard error of measurement ex: when the reliability coefficient equals 1.0, the standard of error equals 0, but when the reliability coefficient equals 0, the standard error is eaul to the SD of the test scores

What is validity?

-a test's accuracy

What is the multitrait-multimethod matrix?

-used to systematically organize the data collected when assessing a test's convergent and discriminant validity -it is a table of correlation coefficients -provides info about the degree of association between two or more traits that have each been assesssed using 2 or more methods -when the correlations between different methods measuring the same trait are larger than the correlations between the same and different methods measuring different traits, the matrix provides evidene of test's convergent and discriminant validity

What is the kappa statistic (k)?

-used when scores or ratings represent a nominal or ordinal scale of measurement -

When is a test valid?

-when it measures what it is intended to measure

What are factor loadings?

-correlation coefficients that indicate the degree of association between each test and each factor -one way to interpret a factor loading is to square it to determine teh amount of variability in test scores that is accounted for by the factor

How do you select the method of estimating reliability?

-depends on nature of the test -each method entails different procedures and is affected by different sources of error -many tests, more than one method should be used

Which of the following methods for evaluating reliability is most appropriate for speed tests?

Coefficient of equivalence -

To assess the internal consistency reliability of a test that contains 50 items that are each scored as either "correct" or "incorrect," you would use which of the following?

KR-20 -The Kuder Richardson formula (KR-20) is a measure internal consistency reliability that can be used when test items are scored dichotomously (correct or incorrect).

To estimate the effects of lengthening a 50-item test to 100 items on the test's reliability, you would use which of the following?

Spearman-Brown formula

What is alternate (equivalent, parallel) forms reliability?

-2 equivalent forms of the test are administered to the same group of examinees and 2 sets of scores are correlated -it indicates the consistency of responding to different item samples (two test forms) and when the forms are adminsitered at different time, the consistency of responding over time -it is also called the coefficient of equivalence and stability when a relatively long period of time separates the administration of two forms

Which framework is best?

-Classical (CTT) has a longer history than item response (IRT), theoretical asssumptions weaker because they can easily met by most test data -focused more on test level than item level info CTT most useufl when the size of the sample being used to develop the test is small underlying assumptions of IRT are more stringent -focused on item level info and requires larger sample

What is content validity?

-a test has it if it adequately samples the content or behavior domain that it is designed to measure -if test items are not a good sample, results of testing will be misleading -it is sometimes used to establish validity of personality, aptitude, and attidude tests, most associated with achievement type tests that measuare knowledge of one or more content domains and with tests deisgned to assess a well defined behavior domain -adequate content validity important for a stat test and for a work sample test

What is a construct?

-abstract characteristic that cannot be observed directly but must be inferred by observing its effects -intelligence, mechanical aptitude, self-esteem, neuroticism

What is the coefficient alpha?

-administering the test once to a single group of examinees -don't split test, but a special formula is used to determine the average degree of inter-item consistency -it is the average reliability that would be obtained from all splits on the test -tends to be conservative and can be considered the lower boundary of a test's reliabilty -when test items are scored dichotomously, a variation called the Kuder-Richardson Formula 20 used

When is the coefficient of concordance used?

-also known as Kendall's coefficient of condordance -used to assess inter-rater reliability when there are more raters and ratings are reported as ranks

When is the test-retest reliability method used?

-appropriate for determining the reliability of tests designed to measure attributes that are relatively stable over time and that are not affected by repeated measurement -test of aptitude, which is a stable character, but not on a test of mood, since mood fluctuates over time or a test of creativity which might be effected by previous exposure

How does guessing impact the reliability?

-as the probability of correctly guessing answers increases, the reliability coefficient decreases -a true/false test will have lower reliaibilty coefficient htan a 4 alternative multiple choice test, will have lower reliability coefficient than a free recall test

What is information obtained in factor analysis used for?

-assessing a test's construct validity -test is considered to have a construct (factorial) validity when it has high correlation with the factors that it would be expected to correlate with and low correlation with factors it would. not be expected to correlated with -when employed as a method for assessing construct validity, factor analysis is another way to obtain info about convergent and discriminant validity

What is percent agreement?

-calculated by dividing the number of items or observations in which raters are in agreement by the total number of items or observations ex: if 2 raters assign the same ratings to 40 of 50 behavioral observations, the percent agreement is 80% -easy to calculate and interpret, but it can lead to erroneous conclusions because it does not take into account the level of agreement that would have occurred by chance alone -a problem for behavioral observation scales that require raters to record the frequency of a specific behavior -degree of chance agreement is high whenever the behavior has a high rate of occurrence, and percent agreement will provide an inflated estimate of the measure's reliability

What are the 2 measurement frameworks most commonly used to construct and evaluate tests?

-classical test theory and item response theory

What is monotrait-heteromethod coefficients (same-trait-different method)

-coefficients indicate the correlation between different measures of same trait -when coefficients are large, provide evidence of convergent validity

What is heterotrai-monomethod coefficients (different traits-same methods)

-coefficients show the correlation between different trais that have been measured by same method -when it is coefficent small, indication of discriminant validity

What is communality?

-common variance or the amount of variability in test scores that is due to the factors that the test shares in common and to other tests too -a test's communality indicates the total amount of variability in test scores that is accounted for by all the identified factors

What is factor analysis?

-conducted to identify the minimum number of common factors required to account for the intercorrelations among a set of tests, subtests or items -

What is an item characteristic curve (ICC)?

-constructed for each item by plotting the proportion of examinees in the sample who answered it correctly against either the total test score, performance on external criterion, or a math derived estimate of latent ability or trait being measured by the item -curve provides info on the relationship between the examinee's levl on the ability or trait and the probability that he will respond to the item correctly -various IRT models produce ICC's that provide info on either 1,2, or 3 parameters

Content validity vs. face validity

-content validity: systematic evaluation of a test by experts who determine whether or not test items adequately sample the relevant domain face validity: whether or not a test looks like it measures what it is intended to measure -face validity is not an actual type of validity, it is a desirable feature for tests -if a test lacks face validity, examinees might not be motivated to respond to items in honest way -a high degree of face validity, does not mean it has content validity

What is heterotrait-hetermethod coefficient (different-trait-different method)

-correlation between different traits being measured by different measures -provide evidence of discriminant validity when small

What is a reliability coefficient?

-correlation coefficient that ranges from 0.0 to +1.0 -when a test's reliability coefficient is 0.0, this means that all variability is in obtained test scores is due to measurement error -when a test's reliability coefficient iis +1.0, this indicates that all variability in scores reflects true score variability -symboloize with the letter r -subscript that contains 2 of the same letter or numbers (rxx) -subscript indicates that the correlation coefficient was calculated by correlating a test with itself rather than some other measure

How do you interpret the reliability coefficient?

-directly as the proportion of variabiity in a set of test scores that is attributable to true score variability -ex: a reliability coefficient of .84 indicates that 84% of variability in test scores is due to true score difference -while the reamining 16% is due to measurement error -most tests, .80 or larger is acceptable

What do you. have to consider when interpreting the test reliability?

-effects on scores achieved by a group of examiness -score obtained by a single examinee

What is the test-retest reliability method?

-estimates reliability -administer the same test to the same group of examinees on 2 different occasions and then correlating the two sets of scores -reliability coefficient indicates the degree of stability of examinees scores over time -also known as the coefficient of stability -primary sources of measurement error test-retest reliability are any random factors related to the time that passes between the two tests -these time sampling factors include random fluctuations in examinees over time and random variations in testing situation

What is item discrimination?

-extent to which a test item discriminates between examinees who obtain a high vs low score on entire test -measured with the item discrimimation index (D) -requires identifying the examinees in the sample who obtained the highest and lowest scores on test (often upper 27% and lower 27%) -for each item, subtracting the percent of examinees in the lower scoring group (L) from the percent of examinees in the upper scoring group (U) who answered item correctly -D=U-L -item discrimination range from -1.0 to +1.0 -if all examinees in the upper group and none in lower group answer item correctly, D is equal to +1.0 -if none of the examiness in upper grouop and all examinees in lower group answer item correctly, D=-1.0 -item with discrimination index of .35 or higher is considered acceptable -items with moderate difficulty levels (around .50) have greatest potential for maximum discrimination

What are sources of error for inter-rater reliabilty?

-factors related to the raters such as lack of motivation and rater biases and characteristics of the measuring device -an inter-rater reliability will be low when rating categories are not exhaustive and or not mutually exclusive -may be effected by consensual observer drift: when 2 observers working together influence other's ratings so that they both assign ratings in a similar way -consensual drift tends to artificially inflate inter-rater reliability -reliability of ratings can be improved, but the best way is provide training fo raters and periodic retraining

What is a confidence interval?

-helps a test user estimate the range within which an examinee's true score is likely to fall given his or her obtained score

What is convergent validity?

-high correlations with measures of the same or related traits provides this

how is an item's level of difficulty indicated on curve?

-indicated by the level at which 50% of examinees in the sample provided correct answers -difficulty level for the item depicted here is 0, which corresponds to an average ability level and indicates the item is of moderate difficulty (person with average ability has 50/50 shot of answering it right

how is the probability of guessing indicated on curve?

-indicated by the point at which the ICC incercepts the vertical axis -here indicates that there is a low probability of guessing correctly for this item, only a small proportion of examinees with very low ability answered it correctly

How is the reliability coefficient interpreted?

-interpreted directly as the proportion of variability in obtained test scores that reflects true score variability -ex: a reliability coefficient of .84 indicates that 84% of variability in scores is due to differences among examinees, while the remaining 16% is due to measurement error

What is the primary source of measurement error in the alternate forms reliability?

-is content sampling -or error introduced by an interaction between examinees knowledge and the different content assessed by items included in two forms -when administration of the two forms is separated by a period of time, time sampling factors also contribute to error

What is inter-rater reliability?

-is of concern whenever test scores depend on rater's judgement -test constructor would want to make sure that an essay test, a behavioral observation scale, or a projective personality test have adequate inter-rater reliability -assessed by calculating a correlation coefficient or the percent agreement for the scores or ratings assigned by two or more raters

How is the standard error intepreted?

-it can be interpreted in terms of the areas under the normal curve -with regard to confidence intervals, this means that a 68% confidence interval is constructed by adding and subtracting one standard error to an examinee's obtained score a 95% confidence interval is constructed by adding and subtracting 2 SD and a 99% confidence interval is constructed by adding 3 SD

What are the limitations of CCT?

-item and test parameters are sample-dependent -item difficulty and item discrimination indice, the reliability coefficient, and other measures derived from CTT are likely to vary from sample to sample -it is difficult to equate scores obtained on different tests that have developed on basis of CTT -ex: a total score of 50 on an english test does not necesssarily mean the same thing as a score of 50 on a math test or a different english test -

how is the item's ability to discriminate indicated on curve?

-item's ability to discriminate between high and low achiever's is indicated by the slope of the curve -the steeper the curve, the greater the discrimination -item here, has good discrimination, it indicates that examinees with low ability (below 0) are more likely to answer incorrectly, while those above 0 more likely to answer correctly

What is a correlation coefficient?

-kappa statistic -coefficient of concordance

What effects the magnitude of the reliability coefficient?

-length of the test, the range of scores, and the probability that the correct response to itends can be selected by guessing

What effects the optimal difficulty level of an item?

-likelihood that the examinee can choose the correct answer by guessing, with the preferred level being halfway between 1.0 and the level of success expected by chance alone -ex: true/false items, the probability of obtaining a correct answer by chance alone is .50 -if the goal is to choose a certain number of examinees, the optimal difficulty level corresponds to the proportion of examinees to be selected ex: for grad admission test, if only 15% of applicants are to be admitted, items will be chosen so the the average difficulty level for all items including in the test will be .15

What is discriminant validity?

-low correlations with measures of unrelated traits provides evidence for this

What is the error component (E)?

-measurement error

How does one establish a test's construct validity?

-no single way -a systematic accumulation of evidence showing that the test actually measures the construct it was designed to measure 1) assessing the test's internal consistency: do scores on individual test items correlate highly with the total score? are all test items measuring the same construct 2) studying group difference: do scores on the test accurately distinguish between people who are known to have different levels of construct? 3) conducting research to test hypothesis about construct: do test scores change following an experimental manipulation in the direction predicted by the theory underlying the construct? 4) assessing the test's convergent and discriminant validity: does the ttest have high correlations with measures of the same and related traits? and low correlations with measures of unrelated traits? 5) assessing the test's factorial validity: does the test have the factorial composition would be expected to have?

When can you use the alternate forms reliablity?

-not appropriate to use when attribute measured by test is likely to fluctuate over time and the forms will be administered at different times or when scores are likely to be affected by repeated measures -some say it is the most rigorous method for estimating reliabilty, it is not often assessed due to the difficulty in developing forms that are truly equivalent

What is item difficulty?

-p= total number of examiness passing the item/total number of examinees Percentage of people who answer the item correctly (p) -value of p is 0 to 1.0 and larger value indicating easier item -when p is 0, none of the examinees in sample answered the item correctly -p is equal to 1.0, all examinees answered it correctly -

What is the Spearman Brown prophecy formula?

-provides an estimate of what the reliability coefficient would have been based on the full length of the test

What is reliability?

-provides us with an estimate of the proportion of variability in examinee's obtained score that is due to true differences among examinees on the attributes measured by the test -when a test is reliable, it provides dependable, consistent results, -term consistency is synonym with reliability

What is measurement error?

-random error -due to factors that are irrelevant to what is being measured by the test -have unpredictable effect on an examinee's test score ex: score you obtain on licensing exam (X) is likely due to both the knowledge you have about topics addressed by exam items (T) and the effects of random factors (E) such as the way the test is written, any alterations in anxiety, attention, or motivation while taking test, and accuracy of educated guesses

What is a true score component (T)?

-reflects the examinee's status with regard to the attribute that is measured by the test

How do you interpret the standard error or measurement?

-reliability coefficient good for estimating true score variability, not so good for interpreting one score -when a person receives a score of 80 out on a 100 item test that has a reliability coefficient of .84, we can only conclude since test isn't perfectly reliable that obtained score might or might not be true score -

What is monotrait-monomethod coefficients (same trait-same method)?

-reliability coefficients -indicate the correlation between a measure and itself -not directly relevant to a test's convergent and discriminant validity, they should be large in order for the matrix to provide useful info

How does range of scores impact reliability?

-since reliability coefficient is a correlation coefficient, it is maximized when the range of scores is unrestricted -range is directly affected y the degree of similiarity of examinees with regard to the attribute measured by the test -when examinees are heterogeneous, the range of scores is maximized -range is also affected by difficulty level of test items -when all items are either very difficult or very easy, all examinees will obtain either low or high scores, resulting in restricted range -best strategy is to choose items so that the average difficulty level is in the mid range (.50)

Do you need to demonstrate all validities?

-some test, you need only one validity, but others you might need more than one type ex: if a math achievemet test will be used to assess the classroom learning of 8th grade students, establishing the test's content validity would be sufficient -if same test will be used to predict the performance of 8th grade students in an advanced high school math class, content and criterion related validity need to be done

Standard Error of the Difference Between Two Scores?

-sometimes calculate a difference score to compare performance on two different tests, an examinee's performance on the same test when it is administered on two different occasions, or performance of two different examiners on same test -each test score has some degree of measurement error -a difference score contains measurement error from both test scores, must be interpreted with caution -standard error of the difference is used to help determine whether a difference score is significant and is calculated by adding the square of SEM of the first score to the square of SEM of second score and taking the square root of the result -interpreted in terms of areas under the normal curve

What does test construction consist of?

-specificying purpose of test -generating test items -administering items to a sample of people for purpose of item analsysis -evaluating the rests reliability and validity -establishing norms

Why is content sampling a source of error?

-split half: error resulting from differences between the content of 2 halves(items on one half may be better fit) -coefficient alpha: content sampling refers to diffrences between individual test items rather tahn between test halves -ex: a test is heterogenous with regard to content domain when its items measure several different domains of knowledge or behavior -greater the heterogeneity of the content domain, the lower the inter-item correlations and the lower the magnitude of the coefficient alpha -coefficient alpha could be expected to be smaller for a 200 item test that contains items assessing knowledge of a lot of different psychologies than a 200 item test on test construction only

What are the 2 measures of reliability used to measure the internal consistency of a test?

-split-half reliability and coefficient alpha both involve administering the test once to a single group of examinees, and both yield a reliability coefficient that is known as the coefficient of internal consistency

How do you calculate a confidence interval?

-standard error of measurement (SEM) -index of an amount of error that can be expected in obtained scores due to the unreliability of the test -when raw scores have been converted to percentile ranks, the confidence interval is referred to as percentile band -

How do you determine the split-half reliability of a test?

-test is split into equal haves so that each examinee has two scores and the scores on the two halves are then correlated -tests can be split in several ways, but most common way is to divide the test on basis of odd vs. even numbered items -problem with this test is it produces reliability coeffficient that is based on a test scores that were derived from one half of the entire length of the test -if a test contains 30 item, each score is based on 15 items -because reliability tends to decrease as the length of a test decreases, the split half coefficient usually underestimates a test's true reliability -split half reliability corrected using the Spearman Brown prophecy formula

What are advantages of the Item Response Theory (IRT)?

-the item characteristics (parameters) derived from IRT are considered to be sample invariant-they are the same across different samples -because test scores are reported in terms of an examinee''s level on the trait being measured (rather than in terms of total test score), possible to equate scores from different sets of items and from different tests -use makes it easier to develop computer-adaptive tests, in which admin of subsequent items is based on the examinee's performance on previous items

How does test length impact the reliability coefficient?

-the larger the sample of the attribute being measured by a test, the less relative effects of measurement error and more likely sample will provide dependable, consistent info -the longer the test length, the larger the test reliability coefficient

When should you use a test of internal consistency reliabiltiy?

-useful when test is designed to measure a single characteristic, when the characteristic measured by the test fluctuates over time or when scores are likely to be affected by repeated exposure to the test -not appropriate for assessing reliabilty of speed tests because these test they tend to produce spuriously high coefficients for speed test, the alternate form reliability the best

Is content validity built into a test?

-usually, built in as test is constructed through a systematic, logical, and qualitative process -clearly identifying the content or behavior domain to be sampled -writing or selecting items that represent that domain -once a test has been developed, the establishment of content validity relies primarily on the judgment of the subject matter experts -when experts agree that test item are an adequate and representative sample of the target domain, this provides evidence of content validity -methods of quanitifying the judgements of subject matter are availble -the content validity ratio (CVR): ratio of subject matter experts who rated an item as essential to defining the content domain of interest to the total number of raters

What is construct validity?

-when a test is found to measure the hypothetical trait (construct) it is intended to measure, it has this

What is an example of Standard Error of Measurement?

-you administer an interpersonal assertiveness test to a sales applicant who receives a score of 80 -since the test's reliability is less than 1.0, you know that his score might be an imprecise estimate of the applicant's true score and decides to use the standard error of measurement to construct a 95% confidence interval -test reliability is .84 and SD is 10, standard error of measurement is equal to 4 -constructs a 95% confidence interval by adding and subtracting two standard errors from the obtain score: 80+-2(4.0)=72 to 88

What can the SpearmanBrown prophecy formula be used for in terms of the length of tests?

-you can use it to estimate effects of lengthening or shortening a test on it's reliability coefficient -ex: if a 100 item test has a reliabilty coefficient of .84, the Spearman Brown formula could be used to estimate the effects of increasing the number of items to 150 or reducing it to 50 -problem is that it does not always yield an accurate estimate of reliability -tends to overestimate a test's true reliability -most likely a case when added items do not measure the same content domain as the original items and or more susceptible to measurement error

The optimal item difficulty index (p) for items included in a true/false test is:

.75. -One factor that affects the optimal difficulty level of an item is the likelihood that an examinee can choose the correct answer by guessing, with the preferred level being halfway between 100% and the level of success expected by chance alone. -For true or false items, the probability of obtaining a correct answer by chance alone is .50. -Therefore, the optimal difficulty level for true or false items is .75, which is halfway between 1.0 and .50.

The item difficulty index (p) ranges in value from:

0 to +1.0 -The item difficulty index (p) indicates the proportion of examinees in the tryout sample who answered the item correctly -The item difficulty index ranges in value from 0 to +1.0, with 0 indicating that none of the examinees answered the item correctly and +1.0 indicated that all examinees answered the item correctly.

What are the steps to factor analysis?

1) administer several tests to a group of examinees -tests should measure construct and some should not 2) correlate scores on each test with scores on every other test to obtain a correlation (R) matrix -high correlations between test: measuring same construct -low correlation between test: measuring different construct -pattern of correlations between the test that indicate how many factors should be extracted in the analysis 3) using one of several factor analytic techniques, convert the correlation matrix to a factor matrix -factor matrix contains correlation coefficients (factor loadings) which indicate degree of association between each test and factor 4) simplify the interpretation of the factors by rotating them -pattern of factor loadings in the original factor matrix is usually difficult to interpret so the factors are rotated to obtain a clearer pattern of correlations -rotation can produce either orthogonal or oblique factors 5) interpet and name factors in the rotated factor matrix: appropriate labels for the factors are determined by examining the tests that do and do not correlated with each factor -when factor analysis is being conducted to assess a test's construct validity, validity is demonstrated when the test has high correlation with the factor it would be expected to correlate with and low corretions with factors it is not expected to correlate with

What are intended uses for tests?

1) test will be used to obtain info about an examinee's familiarity with a particular content or behavior: content validity 2) test will be used to determine the extent to which an examinee possesses a particular hypothetical trait: construct validity 3) test will be used to estimate or predict an examinee's standing or performance on an external criterion: criterion related validity

A researcher correlates scores on two alternate forms of an achievement test and obtains a correlation coefficient of .80. This means that ___% of observed test score variability reflects true score variability.

80 -A reliability coefficient is interpreted directly as a measure of true score variability - a correlation coefficient of .80 would equate to 80%.

Is construct validity the most theory laden of the methods of test validation?

Yes. -developer of a test designed to measure a construct begins with a theory about the nature of the construct which then guides the developer in selecting items and in choosing methods for establishing validity ex: if the developer of a creativity test believes that creativity is unrelated to intelligence, that creativity is in an innate characteristic that cannot be learned, and that creative people can be expected to generate more alternative solution to certain problems than non-creative types, she would want to determine the correlation between the scores on the creativity test and a measure of intelligence, see if a course in creativity affects test sores, and find out if test scores distinguish bween groups

For many tests, are items with moderate level of difficulty retained (p close to .50)

Yes. -useful because it increases test score variability -helps ensure scores will be normally distributed -provides maximum discrimination between examinees -helps maximize the tests reliabilty -

The kappa statistic for a test is .95. This means that the test has:

adequate inter-rater relability. -The kappa statistic (coefficient) is a measure of inter-rater reliability. -The reliability coefficient ranges in value from 0 to +1.0. Therefore, a kappa statistic of .95 indicates a high degree of inter-rater reliability.

To determine a test's internal consistency reliability by calculating coefficient alpha, you would:

administer the test to a single sample of examinees one time -Determining internal consistency reliability with coefficient alpha involves administering the test once to a single sample of examinees and using the formula to determine the degree of inter-rater reliability.

A problem with using percent agreement as a measure of inter-rater reliability is that it doesn't take into account the effects of:

chance agreement among raters.

What are the 2 methods of item analysis for CTT?

item difficulty and item discrimination

Content appropriateness, taxonomic level, and extraneous abilities are factors that are considered when evaluating:

the relevance of test items. -relevance refers to the extent to which test items contribute to achieving the goal of testing. -Content appropriateness, taxonomic level, and extraneous abilities are three factors that may be considered when determining the relevance of test items.

According to classical test theory, total variability in obtained test scores is composed of:

true score variability plus random error.


Ensembles d'études connexes

World History II "American and French Revolution Test"

View Set

Operations Management (Chapter 1)

View Set

CHAPTER 1 - THE PROFESSION OF NURSING

View Set

Network Security, Firewalls, and VPNs Textbook (Third Edition) Answer Key

View Set

History of ASL, American Sign Language, ASL Unit 1 Quiz, ASL 1A Final Study, ASL midterm, ASL 1 Study Guide

View Set