Test and Measurements Unit 4 Quiz

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What are the three approaches to the estimation of reliability?

1) test-retest 2) alternate or parallel forms 3) internal or inter-item consistency

A reliability estimate of a speed test should be based on performance from two independent testing periods using what?

1) test-retest reliability 2) alternate-forms reliability 3) split-half reliability from two separately timed half test. * if a split-half procedure is used, the obtained reliability coefficient is for half test and should be adjusted using the Spearman-Brown formula

Norms

Behavior or performance that is usual, average, normal, standard, expected, or typical. Normative data; the test performance data of a group of test takers, designed as a reference for interpreting, or otherwise placing in context individual test scores.

Norm Reference Test

Comparing the results of one person to others to see how they stand in relation to mean and SD. Use percentile Useful primarily when we need to compare individuals with one another or with a reference group in order to evaluate differences between them on the characteristic the test measures EX: Career, personality inventories

Grade norms

Derived by locating the performance of test takers within the norms of the students at each grade level - and fractions of grade levels - in the standardization level Can be misleading

Criterion Reference Test

Describes the specific types of skills, tasks, or knowledge that the test taker can demonstrate Knowledge that we want person to have Contrast with norm-referenced test Mostly applied in educational settings Setting cutoffs to determine pass/fail Ex: Drivers license test, college exam

random error

Is caused by any factors that randomly affect measurement of the variable across the sample It does not have any consistent effects across the entire sample It adds variability to the data but does not affect average performance for the group Examples that may cause this are: stress, guessing, external distractions, subjective scoring etc.

systematic error

Is caused by any factors that systematically affect measurement of the variable across the sample ________ _______ tend to be consistently either positive or negative -because of this, systematic error is sometimes considered to be bias in measurement Does effect average

Characteristics of Errors

Mean error of measurement = 0 True scores & error are uncorrelated (rte=0) The standard deviation of the error is greater than zero (sde>0)

Internal reliability

Reflects the consistency of the test items by measuring Item homogeneity/heterogeneity. Do the items overlap and test the same measure? The longer the test, the higher its ________ ________, only if all the questions measure the same trait. Common methods for determining internal reliability: Split-half method, Internal consistency methods (Cronbach's Coefficient Alpha; Kuder-Richardson Formula 20)

What is split-half reliability?

correlating pairs of scores obtained from equivalent halves of a single test administered once. It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice. - a reliability estimate is based on the correlation between scores on two halves of the test and is then adjusted using the Spearman-Brown formula

universe

details of the particular test situation; described in terms of facets

equating alternate forms (same examinees)

distribute both forms of the test randomly to a large representative sample of examinees. Generate the descriptive statistics (mean, sd) on both forms of the test. Equating raw score from one test to the scale score of the second test (using z scores).

generazability study

examines how generalizable scores from a particular test are if the test is administered in different situations; examines how much of an impact different facets of the universe have on the test score

item response theory

focuses on the extent to which individual test items are useful in evaluating individuals presumed to possess various amounts of a particular trait or ability

alternate forms

forms that are different versions of a test that have been sonstructed so as to be paralell. Although they do not meet the requirements for the legitimate desigation of parallel

equating alternate forms (different examinees)

give to a new sample of examinees a selection of items from the old test together with the new test. The use of an anchor test allows us to seperate differences among examinees from differences between the test forms.

What is inflation of range?

if the variance of either variable in a correlational analysis is inflated by the sampling procedure, then the resulting correlation coefficient tends to be higher

restriction of range or restriction of variance

if the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower

inflation or range or inflation of variance

if the variance of either variable in a correlational anaylsis is inflated by the sampling procedure that the resulting correlation tends to be higher

What is restriction of range?

if then variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower

facets

includes things like the number of items in the test, the amoung of training the test scorers have had, and the purpose of the test administration

What is a status characteristic?

intelligence-obtained measurement would not be expected to vary significantly as a function of time, and either the test-retest or alternate forms method would be appropriate

What is methodological error?

interviews may not have been trained properly, the wording amy have been ambiguous, or the items may have somehow been biased.

item sampling (or content sampling)

once source of variance during test construction; refers to variation among items within a test as well to variation among items btw tests

What are examiner-related variables?

potential sources of error variance. The examiner's physical appearance and demeanor are some factors for consideration here. On an oral examination, some examiners may unwittingly provide clues by emphasizing key words as they pose questions.

What are testtaker variables?

pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medications. A test taker may make a mistake in entering a test response

inter-item consistency

refers to the degree of correlation among all the items on a scale; calculated from a single administration of a single form of a test

coefficient of stability

refers to the estimate of test-retest reliability when the interval btw testing is greater than 6 months

coefficients of generalizability

represents the influence of particular facets on the test score

coefficient of inter-scorer reliability

simplest way of determining the degree of consistency among scorers in the scoring of a test

Kuder Richardson formula 20 (KR-20)

stat of choice for determining the inter-item consistency of dichotomous items, primarily those items that can be scored right or wrong; if test items are more heterogenous, this will yeild lower reliability estimates than the split-half method

variance

stat useful in describing sources of test score variability; equals the standard deviation squared

What are the sources of error variance?

test construction, administration, scoring, and/or interpretation

criterion-referenced test

test designed to provide an indication where a testtaker stands with respect to some criterion such as an educational or a vocational objective; tend to contain material that has been mastered in hierarchical fashion

homogeneity

tests that contain items that measure a single trait; the degree to which a test measures a factor; the extent to which items in a scale are unifactorial

What is inter-scorer reliability?

the degree of agreement or consistency between two or more scorers with regard to a particular measure. * inter-scorer reliability is often used when coding nonverbal behavior

inter- scorer reliability

the degree of agreement or consistency btw 2 or more scores

What is inter-term consistency?

the degree of correlation among all items on a scale. Calculated from a single administration of a single form of a test - useful in assessing homogeneity of the test.

What is the coefficient of equivalence?

the degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel forms.

coefficient of equivalence

the degree of the relationship btw various forms of a test can be evaluated by means of an alternate-forms or parallel forms coefficient of reliability

heterogeneity

the degree to which a test measures different factors

What is sampling error?

the extent to which the population of voters in the study actually was representative of voters in the election. Researchers may have gotten factors right, but did not include enough people in their sample to draw the conclusion that they did.

coefficient alpha

the mean of allpossible split half coorelations, corrected by the Spearman-Brown formula

What is the coefficient alpha?

the preferred statistic for obtaining an estimate of internal consistency reliability. Coefficient alpha is a widely used as a measure of reliability, in part because it requires only one administration of the test

reliability

the proportion of the total variance attributed to true varaiance

What is reliability?

the proportion of the total variance attributed to true variance. The greater the proportion of total variance attributed to true variance, the more reliable the test.

reliability coefficient

the proportion of variance in test scores that is due to or accounted for by variability in true scores. rxx= Stotal (squared) / Sobserved score (squared)

What is the coefficient of inter-scorer reliability?

the simplest way to determine the degree of consistency among scorers in the scoring of a test

What is the Kuder-Richardsom formula 20

the statistic of choice for determine the inter-item consistency of dichotomous items, primarily those items that can be scored right or wrong. * if the test is heterogenous, KR-20 will yield lower reliability estimates than the split-half method

What is classical test theory (CTT)?

the true score (or classical) model of measurement. It is the most widely used and accepted model in the psychometric literature today.

parallel forms

these forms of a test exist when, for each form of the test, the means and variances of observed test scores are equal

The standard error of measurement (standard error of a score)

tool used to estimate or infer the extent to which an observed score deviates from a true score; the standard deviation of a theoretically normal distrubution of test scores obtained by one person on equvalent tests

static characteristic

trait, state, or ability presumed to be relatively unchanging

split half method

type of test used to measure internal reliability. Administered to the same applicants in one test. The test is split into half and each half is correlated with each other. When splitting the test into two it is as if we changed the length of one really long test to two shorter tests (thereby decreasing it's internal reliability). To solve for this the Spearman-Brown prophecy formula is used to adjust the correlation.

error variance

variance from irrelevant random sources

What is error variance?

variance from irrelevant, random sources

What is true variance?

variance from true differences

true variance

variance from true differences

power test

when a time limit is long enough to allow testtakers to attempt all items, and if some items are so difficult that no testaker is able to obtain a perfect score

true score theory

when one seeks to estimate the portion of a test that that is attributable to error

True Score Theory

• When you take a test, the developer is interested in the differences among people. We don't want every person to get the same score. • We want a measure that will be as accurate as possible. • Interested in the variance among all the participants.

speed test

contains items of uniform level (uniformly low) so that when given generous time limits, all testtakers should be able to complete all the test items correctly

What is parallel form reliability?

an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means of variances of observed test scores are equal.

What is alternate forms reliability?

an estimate of the extent to which these different forms of the same test have been affected items sampling, or other error. - typically designed to be equivalent with respect to variances such as content and level of difficulty - a reliability estimate is based on the correlation between the two total scores on the two forms

internal consistency estimate of relaibility (estimate of inter-item consistency)

an estimate of the reliability of a test that is developed without obtaing an alternate form or the test and without having to administer the test twice to the same ppl

split-half reliability

an estimate that is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once; useful measure of reliability when it is impractical to assess relaibility with two tests

generalizability theory

an extension of true score theory wherein the concept of a universe score replaces that of a true score; based on the idea that a person's test scores vary from testing to testing bc of variables in the testing situation

What is a reliability coefficient?

an index of reliability, a proportion that indicates the ratio between the true score variance and the total variance.

reliability coefficient

an index of reliability, a proportion that indicates the ratio btw the true score variance on a test and the total varience

What is odd-even reliability?

assign odd-numbered items to one half of the test and even-numbered items to the other half.

What is a criterion referenced test?

a test designed to provide an indication of where a test taker stands with respect to some variable or criterion, such as an educational or vocational objective. * should contain material that has been mastered in a hierarchal fashion.

What does the Spearman-Brown formula allow?

a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test. It is a specific application of a more general formula to estimate the reliability of a test that is lengthened or shortened by a number of items. - could also be used to determine the number of items needed to attain a desired level of reliability

What is a power test?

a test where some items are so difficult that no test taker is able to obtain a perfect score

dynamic characteristic

a trait, state, or ability presumed to be ever changing as a function of situational and cognitive experiences

What is a dynamic characteristic?

a trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences

What is measurement error?

all of the factors associated with the process of measuring some variable, other than the variable being measured. Broken into systematic or random

Spearman-Brown Formula

allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test

What is test-retest reliability?

an estimate of reliability obtained by correlating pairs of scores from the sample people on two different administrations of the same test - appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time

odd-even reliability

an estimate of splift-half reliability; divide the test by content so that each half contains items equivalent with respect to content and difficulty

alternate forms reliability

Reflects the equivalence of 2 alternate forms of a test Administration of the test is performed on the same applicants. Here the tests measure the same attribute and tests are similar. This test can be administered simultaneously or at two different test periods. Then scores from the first form are correlated with scores on the second form. Tests should be equivalent in terms of content, response process and statistical characteristics. Counterbalancing (within one testing period giving half the group test A first and the other group test B first) is used to guard against order effects. This type of test is difficult to develop --> requires two tests measuring the same thing but are different in terms of content.

test-retest reliability

Reflects the stability of a test over time Administration done to the same applicants and given the same test during two different testing periods. Scores at time one are correlated with scores at time two. Appropriate time to re-test is between 2 weeks and 6 months. This is good for tests that we do not expect to have practice effects, and tests in which the measured trait will not change. Tends to be errors related to the administrator or the individual (possibly reflect random errors). Good: tests of achievement Bad to use: trait tests that change -developmental

Age norms

Relates a level of test performance to the age of people who have taken the test: Mainly relates to cild development Growth charts Most IQ tests (e.g., Stanford-Binet IQ test)

What does error refer to?

The component of the observed test score that does not have to do with the testtaker's ability.

reliability

The consistency or stability of a measure of behavior The extent to which a score from a test is consistent and free from errors of measurement. Reliability scores above 0.8 are very good, however if they are less than 0.6 they are unusable and should be removed.

Kuder-Richardson Formula 20

Used for test with dichotomous items (yes-no; true-false). Used to measure internal reliability. Administration uses same applicants, same test, and the average intercorrelation among test items.

Inter-rater (interobersver) reliability

Used when human judgment of performance is involved in the scoring process Refers to the degree of agreement between 2 or more raters by correlating level of agreement between judges.

cronbach's coefficient alpha

Used with ratio or interval data. Used to measure internal reliability. Administration uses same applicants, same test, and the average intercorrelation among test items.

What is coefficient of stability?

When the interval between testing is greater than six months, it is the estimate of test-retest reliability

What is the formula for error?

X=T+E (x=observed score, T represents true score, E represents error)

Spearman-Brown prophecy

a formula which takes the correlation achieved by the split half method and readjusts it to describe the corrected correlation. Accounts for the changing of lengths in the questionnaire.

What is the average proportional distance?

a measure that focuses on the degree of difference that exists between item scores. - The APD index is not contacted to the number of items on a measure - the APD index is not connected to the number of items on a measure

standard error of measurement (SEM)

a number representing how far are we from the true score. Gives an indication of how much does the error effect us. We want a low score. _____ ______ ____ _______ = standard deviation of the error measurement (also known SEest or SEE)

confidence interval

a range or band of tests scores that is likley to contain the true score

What is random error?

a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process. ex: noise

What is systematic error?

a source of error measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured. (once a systematic error becomes known, it becomes predictable)

standard error of the difference

a statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant

What is a heterogenous test?

a test composed of items that measure more than one trait

What is a speed test?

a test containing items of uniform level of difficulty so that, when given generous time limits, all test takers should be able to complete all the test items correctly

What is a homogeneous test?

a test containing items that measure a single trait.


Kaugnay na mga set ng pag-aaral