EDUC 768 Measurement Theory and Test Construction

¡Supera tus tareas y exámenes ahora con Quizwiz!

proficiency scales

Answers that can be categorized to right/wrong, 1/0, regarless how deep the scale goes in terms of answer choices

dichotomous scales

Answers to an item that are restricted two options, such as yes/no, true/fase

Standardized Item Total Correlation

Standardization of the scores before completing the item total correlation Better to use standardized correlations because you can compare across different item sets with different combinations of items.

matching item

item where multiple rows of scale anchors are provided and the test taker is prompted to match anchors

Item Logits

logits of item difficulty scores used to order items on a proficiency test normalized with mean=0, sd=1

Logit

the inverse of the sigmoidal "logistic" function or logistic transform used in mathematics, especially in statistics. When the function's variable represents a probability p, the logit function gives the log-odds, or the logarithm of the odds p/(1 − p)

Light's G

???

Cohen's Kappa

- A means to assess simple agreement on a nominal scale - calculated as (observed agreement - change agreement)/(1-chance agreement) - ranges from 1.00 to -1.00 - asymptotic (converges to a limit over the distribution) - weighted version: when nominal categories show a hierarchy, institutes weights to bring normality

Mantel-Hanzel Test

- Chi-squared type of test - correlation between rows and columns in a contingency table - used on proficiency tests - used for testing differential item functioning

Standard Error of Estimated True Scores

- Error of ovbserved true score discounted by the validity coefficient - SEt = rxx*SEx - or SEt= rxx*(SDx*(sqrt(1-rxx)))

True Score Equating

- Getting different forms on the same scale (not necessarily equal), need to equate the tests, so they are measuring the same thing and they are producing scores that are similar, so they can be compared - Creating parallel forms and distributing randomly to equal distributions of groups Two methods: equipercentile method and linear conversion

Bimodal Distribution

- Results fro the combination of two dissimilar samples - an equal number of examinees will result in both distributions having the same shape - unequal number of examinees will results in the expected value of one being higher than the other

Kurtosis

- The measure of discrimination in a distribution, gives indication of item variability (how "peaked" the curve is) - Leptokurtic - not enough discrimination on the edges, too much in the middle, Betak>0 - Platykurtic - spread of discrimination too thin, but might not be a bad thing, Betak<0

Item Response Bahavior

- The relationship between the probability of passing an item determined by the score achieved on a test 1) non-linear curves (ogives) 2) curves do not cross one another --> item difficulty is consistent across all score obtained

Floor Effect in test scores

- When the distribution of test scores does not slope down to the left of the mean - will not measure the ability of low performers

Cronbach's Alpha

- a measure of heterogeneity in variance --> whether or not the test it correlating properly with the construct - components: length of the test, variance of scores, variance of a particular component/item - is a measure of internal consistency of an exam - of which is a lower bound estimate

item discrimination

- a measure of how well an item discerns between high achievers and low achievers. - calculated as item difficulty among the upper echelon (test takers who scored in the upper quartile or third for that test) less the item difficulty of the lowest echelon (test takers who scored in the lower quartile or third for that test_ -ranges from -1 (distractor) to 1 (good item) -perfect range is from 0.2 to 0.8 - remember that discrimination happens in the tails of the distribution of scores, so in order to reduce non-discrimnation in the mass, include more items in that area.

Pearson product moment correlation

- a measure of the linear correlation between two continous variables X and Y - knowledgeable of the rank and directionality - the coefficient is blind to the level of the parameter

Correction for Attenuation

- a measure to adjust for measurement error when comparing two tests - rtxty = (rxy)/(sqrt(rxx*ryy)) where: rxy is the observed correllation between two tests rxx, ryy is the interclass correllation for each test

Vertical Equating

- an attempt to equate scores on two tests which are intentionally designed to be different in difficulty and intended for groups with different ranges of abilities, but which measure the same general are of knowledge or general domain of skills - linking items must not have DIF - Uses the Rasch model of logistic probabilities (simlar to 1PL model) - After tests are scaled, compare mean scores from one test to another - so long as difference is significant, then the students gained/lost abilities accordingly. - the reliability of the combined tests will be larger than each individual test

Kendall's Coefficient of Concordance (W)

- an index of interrater reliability of ordinal data. The coefficient could be corrected for ties within raters. - for m raters, rating k subjects, in rank order from 1 to k - perfect agreement = 1.00, perfect disagreement = -1.00, no agreement = 0.50

What is reliability in terms of general testing theory?

- consistency in what is being measured in terms of dimensonality and standardization

What is standardization in terms of general testing theory?

- controlled conditions for the presentation of the items - order, environment

Horizontal Equating

- equating scores between two tests which are similar in content and difficulty and intended for the same population of examinees - linking items must not have DIF - Uses the Rasch model of logistic probabilities (simlar to 1PL model) - Done with equivalent group equating, aka when mean and SD of the two groups are the same, and when randomization wasused in setting up groups. - Forms are equivalent when Mean and SD are close, and when pearson correlation between Rasch logit and Item DIFF are closer to -1

What is generalizability in terms of general testing theory?

- performance/mastery of the sample domain that is applicable to the remainder of the domain - the subset of the domain items used represents the larger domain - longitudinally generalizable

Item Total Correlations

- performed to check if any item in the set of tests is inconsistent with the averaged behavior of the whole test, and thus can be discarded if it does not adhere. The analysis is performed to purify the measure by eliminating 'garbage' items prior to determining the factors that represent the construct - should be in the 0.2-0.8 range - as a discrimination index is limited, because it does not embrace the full amount of information (just considers upper and lower performers)... does not consider the middle performers

What is domain sampling in terms of general testing theory?

- sampling within the domain of the dimension - based on the theory of the dimension (suicide, depression) - not the sampling of the people --> sampling of the domain (aka depression --> suicide, lethargy, etc)

U-shaped distribution

- super discriminate in the middle, no discrimination at the ends - not a lot of easy and hard questions - test is measuring losers vs winners => cut score tests (national merit exam)

Intraclass correlation

- the agreement across tests/teachers for continuous and interval variables - known as r(icc) - is knowledgeable of the rank, directionality, and level of the tests - used in assessing rater reliability - will yeild a higher value than Pearson due to the level of agreement

Item Difficulty

- the portion of correct responses for a given item, calculated as a number of total correct responses over the total number of test takers - constricted variance in the items reduces the amount of information you can glean from the analysis - ranges from 0.00 (hard) to 1.00 (easy) - maximizing difficulty = 0.50 - proper range=> 0.3 to 0.7 - correction for guessing => p+(p/m) where m=number of anchors for that item, but makes the assumption that the probability of getting an answer correct is equal across all answer choices (naive assumption) We need to recognize that the choices on a multiple choice test have their own base rates, we do not really use the adjustment for guessing anymore, because now we would need a much more complicated algorithm to adjust for real base rates. We do not really adjust for guessing anymore, but the theory behind the idea of guessing is still discussed/important to consider

Item standard deviation

- the spread around the expected score for that item - items should have a relatively healthy and large SD - the lower the SD, the more issues with other measures the item will have - the best way to identify & symptomize the worst items

Standard Error of Measurement

- the sum of all errors between the regression line (expected Y) and the observed scores - equated as the standard deviation times the square root of one minus the reliability coefficient - the half-width of the measurement level (one side of the confidence interval)

What is validity in terms of general testing theory?

- what you are measuring is what was purported to be measured - domain sampling, generalizability, reliability

Ceiling Effect in test scores

- when the distribution of test scores does not slope down to the right the mean - will not measure the ability of high performers

IRT Distribution

-This is a function of the trait level (theta) predicting the Standard Error of Measurement. - For FIXED ITEM EXAMS, is a u-shaped curve where error is the lowest when Theta = 0, and increases the more Theta is further from 0 in either direction - For CAT exams, the SEM remains constant regardless of the Theta level. - In either case, the longer the test, the lower the SEM

Spearman-Brown Prophecy

-a formula relating psychometric reliability to test length and used by psychometricians to predict the reliability of a test after changing the test length. -the finding is that as the length of a test increases so does its reliability, but as length increases, diminishing returns in terms of reliability sets in - can be re-engineered to measure the number - perfect agreement = 1.00, perfect disagreement = -1.00, no agreement = 0.00

What is dimensonality in terms of general testing theory?

-limits on what is being measured -discernible -homogeniety in the variance of the stems with Cronbach's alpha and omega

aspects of general testing theory

1) Dimensionality - Principle sets limits on what is being measured. It is assumed that in testing, that the test reflects some dimension. It is typically either a unidimensional test or a multidimensional test. Need to know ahead of time what those dimensions are. 2) Domain sampling - Level of proficiency or level of attribution is measured by a relatively small representative number of tasks. 3) Standardization - Need to make sure there are controlled situations to gather data. 4) Generalizability - Generalize level of attribution or level of proficiency of summary scores across to the rest of the domain (ex: the domain of mathematics, etc.). And also generalize to other times. 5) Reliability - Consistency of measurements. Does depend on homogeneity (sameness) of dimensionality and the accuracy of the standardization. 6) Validity - Does the test measure what it is supposed to be measuring? Validity is dependent on domain sampling and reliability

Item Response Theory Assumptions

1) Item response behavior is a function of a latent trait (theta) that is the subject parameter (where Beta is the item difficulty (item parameter)) 2) Appropriate dimensionality (theta) Unidimensionality - one trait at a time / one latent trait at a time Only thing that ties these items together in a test is thetas 3) Local independence Beta1=0.8, Beta2 =0.7 ==> P(B1^B2)=0.56 ==> Independence Probability of any item is independent of any other item 4) The probability of passing an item is determined exclusively by the elements (components) in the model

Items to compare between 1PL, 2PL, and 3PL Models

1) Maximum Information Approximations for Theta 2)Essential Statistics (Mean, SD, empirical reliability, reliability index, average information, approximate max information (and at what theta level) 3) The result of EM Algorithm and Newton Cycles (-2 Log Likelihood, number of cycles needed to converge) 4) Chi squared tests of differences between -2 Log Likelihood between models (where the lower the value the better fitting the model, so long as the difference is significant) 5) Comparing the test information function between each model

What are the assumptions used in true score theory?

1) Observed score = True score + Error (X=T+E) additivity assumption: E=X-T 2) Population mean of the observed scores is the true score (E(X)=T) True scores are based on empirical reality 3) No correlation between true scores and error (rho(te)=0) Anything other would amount to contamination The error must be random and unsymmetric Violation: lack of fairness in testing 4) Cross-test errors are not correlated (rho(e1e2)=0) math vs science test 5) All-in-all, errors should be correlated with nothing 6) If two tests have observed scores X and X' that meet assumptions 1-5, and if, for every population of examinees, T=T', and standard deviations are equal, then the tests are called parallel tests Also implies that the true mean scores of the tests are equal 7) If two tests have observed scores X1 and X2 that satisfy assumptions 1-5, and if, for every population of examinees, T1=T2+c, where c is a constant, then the tests are called essentially tau-equivalent, and scores across tests will be related

What are the assumptions of the chisquared goodness of fit test?

1) The sample values are independent. 2) The sample values are grouped in C categories, and the counts of the number of sample values occurring in each category is recorded. 3) The hypothesized distribution is specfi ed in advance, so that the number of observations that should appear each category, assuming the hypothesized distribution is the correct one, can be calculated without reference to the sample values. 4) The sample must be drawn randomly from the population.

Polyserial Correllation

A correlation between a continuous variable and an ordinal variable Makes adjustment based on ordinal nature of measurement rather than dichotomous/continuous Will be higher than Pearson, and will be easier to find statistical significance Important for reporting Item Total Correlation, but if the items are ordinal, we want the items to perform at their best & adjust for the fact that they are ordinal (so we can just the polyserial adjustment)

Confirmatory Factor Analysis

A different way of testing unidimensionality for polytomous items Hypothesis: each item has latent dimension/salient component (common), and error Care about fit-statistics: Covariance matrix being produced

2PL Model

A direct assessment of the probability of getting an item correct given the ability of the test taker (theta), the item difficulty (beta), and the slope of the item aka discrimination (alpha)

3PL Model

A direct assessment of the probability of passing an item given the test taker's ability (theta), the item difficulty (beta), the item discrimination (alpha), and guessing chance (gamma) For dichotomous items where there are more than two alternative responses (A,B, C...) The chance at guessing lifts the asymptote (aka the curve so that the y-intercept is not at zero)

Differential Item Functioning

A measure of internal bias for each item TST: if the proportion of individuals getting the item correct is NOT the same for each population group being considered IRT: if an item has a different item response function for one group than for another, it is clear that the item is baised DIF may be expected, albeit not due to bias (ADHD testing) It does matter when using linking items to equate cross-test equivalence Analysis Method: Conducting a1PL test for multiple groups (removng items with negative collrelations in eithergroup), against a 1Pl model without groups, comparing the -2 Log Likelihood across models.If the difference is significant, then DIF is evident in the test. Addiiontal analysis needed to identify individual items (using two-sided t-test) ANY ITEMTHAT HAS DIF CANNOT BE USED IN EQUATING STUDIES

Kendall's Tau

A measure of nonlinear dependence between two random variables Specially used for dependent observations done by the same person For use on ordinal data

1PL Model

A model for assessing the odds for passing test question, driven by the ability of the test taker (Theta) and the difficulty of the question (Beta). A direct model Just a really awfully long way of proving that Rasch's Log Odds Model works Odds ratio for passing an item = (true positives/false positives)/(false negatives/true negatives) Items analyzed in a 1PL model will have the same slope (alpha) and loading (alpha/sqrt(1+alpha^2)) An assumption of the 1PL model is that all who get the same score and assumed to have the same level of ability

Item Characteristic Curve (ICC)

A monotonically increasing curve that indicates the proprtion of success given the total test score y-axis: probability of passing item (goes from 0 to perfect 1) x-axis: trait level, or test score. Goes from 0 to max possible score The steeper the slope, the more discriminating the item A good comparative point is where they cross the 0.5 success threshold

bipolar adjective scale

A scale between two polar adjectives (for example: "Adequate-Inadequate", "Good-Evil" or "Valuable-Worthless")

semanic differential scale

A type of a rating scale designed to measure the connotative meaning of objects, events, and concepts. The connotations are used to derive the attitude towards the given object, event or concept

Maximizing coefficient Alpha

After running analysis, remove items that have an ICC coefficient < 0.2, which leaves items that are more highly correlated. Be careful not to simply delete items that are just too easy or too hard - that supresses alpha

Item Slopes

Also known as the Alpha Slopes are directly related to the reliability of the test. "If the slopes don't make it, the test won't make it" Standards: <0.38 = uni-dimensional loading is in a bad zone (not sufficiently related to the construct) ~0.50 = still pretty bad ~0.70 = decent >1.00 = GOOD

What is the point-biserial correlation coefficient?

An index of association between one truly dichotomous variables and one continuous variable - is a Pearson coefficient aka score on a test vs right/wrong on an item

What is the Phi Coefficient

An index of association between two truly dichotomous variables, which makes it a Pearson coefficient - assumes the expression of a variable bears a one-to-one correspondence with the underlying trait distribution. aka 0/1 item 1 vs 0/1 item 2 A measure of the degree of association between two binary variables. This measure is similar to the correlation coefficient in its interpretation. Two binary variables are considered positively associated if most of the data falls along the diagonal cells (i.e., a and d are larger than b and c). In contrast, two binary variables are considered negatively associated if most of the data falls off the diagonal.

integrated items

An item with one stem but two different answers Where an item has anchors that can be answered in multiple ways (have you been to the hospital in a) the last 21 days? b) the last 90 days?), also known as dual purpose items Trying to get two different types of responses: how things truly are, and how you would like things to be (reality vs. ideal). Motivation happens when you see a difference in how you see things and how you would like them to be.

Likert Scale

An ordinal scale that is a degradation from one point to another

Item Difficulty and IRT

As the examinee's ability level increases in IRT, the likelihood to answer more difficult problems increases in comparison to others at lower levels. Regardless of the examinee's ability level, their likelihood to getting an answer correct will diminish as the questions get harder. Ability level does, however, determine how much that probability diminishes

Grade Equivalence

Based on medians. It is a form of a percentile score, a rank score. There are attempts to make them interpret able to teachers, but it can give very misleading information (same with age equivalent scores) Interpolations... they are percentiles (they are dangerous, especially when discerning true meaning)

benevolence scale

Benevolence reflects the sense that the trustee wants to "do good" to the truster, with "doing good" including concepts such as being caring and open.

What is the biserial coefficient?

Biserial correlation is almost the same as point biserial correlation, but one of the variables is dichotomous ordinal data and has an underlying continuity. For example, depression level can be measured on a continuous scale, but can be classified dichotomously as high/low. NOT a Pearson correlation The formula is: rb = [(Y1 - Y0) * (pq/Y) ] /σy, Where: Y0 = mean score for data pairs for x=0, Y1 = mean score for data pairs for x=1, q = proportion of data pairs for x=0, p = proportion of data pairs for x=1, σy = population standard deviation. Y is the height of the standard normal distribution at z, where P(z'<z) = q and P(z'>z) = p.

Blom Conversion Method

Blom's formula (1958) responds to the curvilinear relationship between a score's rank in a sample and its normal deviate. Because "Blom conjectured that α always lies in the interval (0·33, 0·50)," explained Harter, "he suggested the use of α = 3/8 as a compromise value"

Item Astray Index

Calculated as the item difficulty less the answer choices less 1 over the squared number of answer choices: p-((n-1)/n^2)

Reliability

Correlation coefficients that thrive on variability, but do not thrive on homogeniety A correllation between the scores on a test and the scores on a criterion When we try to validate a test, but if the test is not being used on the "centrality" of the population —the outliers/on extremes of distribution

What is the tetrachoric correlation coefficient?

Correlation of two artificially dichotomized variables, both assumed to have a bivariate normal trait (aka at least interval scale) underlying each variable (tall/short or fat/skinny). Is NOT a Pearson coefficient

Reliability of a linear combination

Created by Nunnally in 1978 a linear combination is an expression constructed from a set of terms by multiplying each term by a constant and adding the results (e.g. a linear combination of x and y would be any expression of the form ax + by, where a and b are constants). : the reliability of a combination of internal consistencies for multiple scales reliability coefficient looks at the correlation of at least two linear combinations For multidimensional test, you would want a reliability coefficient of at least 0.7 If you have multiple tests with reliability of over 0.7, you would have an overall combination of reliability close to 0.9

Precision weighting

Creating assessment weights that are not consistent across all items in order to favor one construct over another`

EM Algorithm

Cycling statistics in Bilog that tracks disparities between the model and data, against a criterion of added model benefit

Newton cycles

Cycling statistics in Bilog that tracks disparities in the standard error of the model and the data, against a criterion of added model benefit These cycles specify the standard error

Kuder-Richardson 20

Dichotomously scored items with a range of difficulty: Multiple choice Short answer Fill in the blank Is an alternative to Cronbach;s Alpha

Factor Loading

Factor loadings are part of the outcome from factor analysis, which serves as a data reduction method designed to explain the correlations between observed variables using a smaller number of factors Factor loading comes from alpha conversion Factor loadings can also be viewed as standardized regression coefficients, or regression weights. Because an observed variable is a linear combination of latent common factors plus a unique component, such a structure is analogous to a multiple linear regression model where each observed variable is a response and common factors are predictors. From this perspective, factor loadings are viewed as standardized regression coefficients when all observed variables and common factors are standardized to have unit variance. Stated differently, factor loadings can be thought of as an optimal set of regression weights that predicts an observed variable using latent common factors.

Partial Credit Model

Similar to Graded Response Model, can be used for a Likert Scale: Get some points for doing some part of the item correct, but the credit is scored on a scale Assumptions: Ordinality Assumes that as more points are awarded in partial credit, it is more difficult to get more points Not an assumption for the GRM Is a Direct Model

Item Biserial Correlations

First measure of determining a good vs bad item Starting values for the EM algorithm Look for: 1) outliers, 2) <0 Remove items that are bad

Unit Weighting

Giving multiple assessments equal weight when computing a candidate's overall score

Linear Conversion Method of Equivalent Groups Equating

If within-forms distributions of pi (item difficulty) is approximately normal and well-centered, the true scores will be normally distributed when you standardize each with the same mean and standard deviation Raw scores sharing the same standard score are said to be equated because they share the same percentile and are linearly related

emergent construct

If, however, it is believed that the construct is the result (i.e. the combined effect) of its indicators, the construct and its indicators are referred to as an emergent construct and formative indicators, respectively.

What is the biserial correlation coefficient?

Index of association between an artificially dichotomized variable and a continuous variable. Assumed that both traits are continuous and on the interval scale of measurement NOT a Pearson coefficient aka: score on a test vs fat/skinny

Fisher's Information

Information is conceptualized as the opposite of error (standard error) Calculated as 1/(SE^2) Want information to be as high as possible, because SE is as low as possible (ranges from 0 to 1) Want to select items that have the higher information... closer to center of distribution, etc. (better items)

True Test Score Theory Bais

Internal: Not relating the test to another criterion - its is about properties internal to the test Ways to measure internal test bias: comparing alpha measures between two salient subgroups Danger of just looking at the overall reliability rather than looking at reliability per salient subgroup (see if there is a reduced reliability in particular groups). Don't want overall reliability to be at 0.7, because it probably means some subgroups drops below (want reliability to be a bit higher, so there is some cushion room for all subgroups to still fall above 0.7) Differential Item Functioning (Mantel Test) External: Measured by intercept and slope test bias method (only one method) Test is predicted differently for different groups Dalbert Rule - sets up standards for introducing evidence (fingerprints, DNA vs. lie detector). Intercept slope bias is good evidence on this rule Prediction Bias

Trait Level and Standard Error of Measurement

Known at theta, this is the target construct in IRT: an individual's ability to perform within that construct. In fixed exams, theta is at its lowest standard error of measurement at level 0 (medium ability), but increases as it moves left (theta <0) for low performers and right (theta >0) for high performers. In adaptive exams, the standard error of measurement remains level regardless of the theta trait level

Skewness

Looks at symmetry of curve If score distribution is shifted left or right If there is a long tail on the right side, it is POSITIVELY skewed (more scores on the low side). For instance, a psychopathology test... most scores are low If there is a long tail on the left side, it is NEGATIVELY skewed (more scores on the high side). For instance, a mastery test would have most scores on the high side

malevolence scale

Malevolence reflects the sense that the trustee does not want to "do good" to the truster, with "doing good" including concepts such as being caring and open

What are the considerations for mastery/clinical tests?

Mastery/clinical tests are not expected to have a centering notion like other tests. Therefore, may need to adjust the rules around the expected range of allowable item difficulty and discrimination Mastery tests (driver's test) - expected left/negative skew (most examinees get correct answers) Clinical tests (mental illness test) - expected right/positive skew (most examinees get incorrect answers)

Non-Equivalent Groups Design

Non-random assignment of people to groups (example: primary languages). Can still use equivalent groups equating methods, but just can't call it equal groups

Area transformation

Not linear transformations. They first transform all raw scores into percentiles, and then assigns a standard score for each percentile point they attempt to normalize the scores, and are also known as "normalized standard scores"

Bilog Outputs

PH1 - classical statistics and item analysis PH2 - calibration PH3 - scoring PAR - parameters (alphas, betas) SCO - scores

Graded Response Model

Probability of falling in or about a given response category conditional on a given trait level (Theta) Looks at set of categories of responses and thinks about the area that separates something from being 0 (never) and 1 (rarely) and 1 (rarely) to 2 (sometimes) Produces operating characteristic curves Threshold level of P(1)=0.50 Similar to the 2PL Model This model is kind of operating like dichotomous (just a 1 vs. 0), it's just chopping it up into different dichotomies Is an Indirect Model

Test Information Curve

Produces an overlay of the test information curve. The left hand side of the plot is the measurement of the error. The right hand side of the plot, is the total test information. The horizontal is the theta ability level. What is important is where the two curves intersect Indicates the practical utility of the test—that point beyond (to the right) of that theta value, is the zone where you would really not want to make any discrimination about theta values Roughly happens at 2 SD theta above the mean (cannot really make any fine discernments)... not useful past 2 SD of the theta ability level

Equipercentile Method of Equivalent Groups Equating

Randomize administration of the forms (A/B) and approximately the same distributions of true scores will result if the forms have the same length and distributions of pi (item difficulty) and di (item discrimination) or rxt (item-total correlation) Raw scores sharing the same standard score are said to be equated because they share the same percentile even if the raw scores are not normally distributed

How validity relates to reliability

Reliability is the upper bounds of validity. Validity = Reliability^(1/2)

Standard Error of Discrepancy Between a Test Score and the Mean of a Set of Test Score

SEm(x-y)=SEy(Ntests-(1/Ntests))^(1/2) Significance = (Mean of test score - test score)/SEm(x-y)

Standard Error of the Discrepancy Between Two Test Scores

SEx-y = sqrt(SEx^2+SEy^2) This gets you the pooled standard error across tests Significance = variance in test scores / SEx-y

What is the tetrachoric coefficient?

Tetrachoric correlation is used to measure rater agreement for binary data; Binary data is data with two possible answers — usually right or wrong. The tetrachoric correlation estimates what the correlation would be if measured on a continuous scale. It is used for a variety of reasons including analysis of scores in Item Response Theory (IRT)and converting comorbidity statistics to correlation coefficients. This type of correlation has the advantage that it's not affected by the number of rating levels, or the marginal proportions for rating levels. The term "tetrachoric correlation" comes from the tetrachoric series, a numerical method used before the advent of computers. While it's more common to estimate correlations with methods like maximum likelihood estimation, there is a basic formula you can use.

Gaussian Curve

The Gaussian distribution is a continuous function which approximates the exact binomial distribution of events. The Gaussian distribution shown is normalized so that the sum over all values of x gives a probability of 1

Shrout-Fleiss Reliability

The Intraclass correlation is used as a measure of association when studying the reliability of raters

Shapiro-Wilk Test for Normality

The Shapiro-Wilks test for normality is one of three general normality tests designed to detect all departures from normality. It is comparable in power to the other two tests. A conjoint test of kurtosis and skew (and it is an inferential statistic) The test rejects the hypothesis of normality when the p-value is less than or equal to 0.05. Failing the normality test allows you to state with 95% confidence the data does not fit the normal distribution. Passing the normality test only allows you to state no significant departure from normality was found. The Shapiro-Wilks test is not as affected by ties as the Anderson-Darling test, but is still affected. The Skewness-Kurtosis All test is not affected by ties and thus the default test. This is restricted to samples between 9 and 50 subjects

Item Anchor

The answer options to a test item,

cognitive competence

The cognitive processes that comprise (i) creative thinking, which includes various creative thinking styles, such as legislative, global, and local thinking styles; and (ii) critical thinking, which includes reasoning, making inferences, self-reflection, and coordination of multiple views.

Item Remainder Coefficient

The correlation between the scores on the item and the score of the sum of the *remaining* items

Stem Anchors

The explanation to the scale answers (Most Often, Sometimes, Never) that are used across one or multiple items

What is the point-biserial coefficient

The point biserial correlation coefficient, rpbi, is a special case of Pearson's correlation coefficient. It measures the relationship between two variables: One continuous variable (must be ratio scale or interval scale). One naturally binary variable.* Many different situations call for analyzing a link between a binary variable and a continuous variable. For example: Does Drug A or Drug B improve depression? Are women or men likely to earn more as nurses? Cautions: *If you intentionally force data to become binary so that you can run point biserial correlation, perhaps by splitting continuous ratio variables into two segments, it will make your results less reliable. There are exceptions to this rule of thumb. For example, you could separate test scores or GPAs into pass/fail, creating a logical binary variable. An example unnaturally forcing a scale into a binary variable: saying that people under 5'9″ are "Short" and over 5'9″ are "tall." One assumption for this test is that the variables are randomly independent. Therefore, the point biserial shouldn't be used to analyze experimental results; use Linear Regression with dummy variables instead.

True Score Theory (TST)

The theory that scores are a function of observed scores and score error That the standard error of measurement is constant and parallel to the true score

Linear Transformation

They maintain the distribution of the original standard scores (same exact distribution shape), will not change shape Always start with z-score calculation, and if you want to rescale, just do a linear transformation using a scalar multiplier and a new location

Notes on Item Analysis

Total invariance is DEADLY (that means everyone has the same score). No variability, then technically it is not a variable and it is just a number Total invariance needs to be removed Partial invariance needs to be hunted for (constricted variance) Correlational statistics is fully dependent on variance. As variance constricts, it undermines the ability for the correlation coefficient to produce large values An item with a constricted variance will lose the ability to correlate with other items and will not contribute information to the rest of the scale Need to also check if an item has missing data Contingency items have an ability to miss data (examinees will either skip or choose not applicable). Missing data is bad business Invalid items/poorly written items will induce people to skip an item (they do not understand the item) is considered missing data (it is not missing data at random) and needs to be fixed People may conscious attempts to see if there is missing data immediately and try and make immediate adjustments to obtain the missing data Contact examinees who may have produced missing data (or teachers who collected the data) Be weary of data that is missing, because you may have a validity issue There are systematic reasons why people are not answering an item and this will influence validity

Kuder-Richardson 21

Used for dichotomously scored items that are all about the same difficulty Cruder estimate of Cronbach's Alpha - blind to the range of item variances/splits, because just reporting the average split of item variances

Longitudinal Transition

When the pre and post tests are on the same scale and can therefore be compared more accurately

multiple stems

When the question stem encompasses multiple constructs ("student is loud and annoys others"), which could lead to confusion when answering a question

Compound weighting in stem

When there adjectives in the stem that emphasizes the construct ("constantly moving in class") which could sow confusion into answering choices.

Using "I don't know" in a tests scale

Would advise against it, as it creates missing data and indicates that you are asking questions to people that are invalid

Spearman's Rho

a non-parametric test used to measure the strength of association between two variables, where the value r = 1 means a perfect positive correlation and the value r = -1 means a perfect negative correlation Requirements: Scale of measurement must be ordinal (or interval, ratio) Data must be in the form of matched pairs The association must be monotonic (i.e., variables increase in value together, or one increases while the other decreases)

Log Odds Model

a psychometric model for analyzing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade-off between (a) the respondent's abilities, attitudes, or personality traits and (b) the item difficulty. Essentially, the probability of an examinee getting a question correct is a function of their ability (theta) less the item difficulty (beta)

omnibus scale

a scale the encompasses multiple subscales

super stem

an item question stem that encompasses multiple items

conditional item

an item that requires a criteria in order to answer it ("if you are Indian, do you like naan?")

contingency item

an item whose answer prompts additional items that are related to it

Steps for calculating the logit

example with item difficulty of 68.2% 1) Calculate the odds ratio: 0.682/(1-0.682)=2.144 2) Calculate the log odds ratio: log(2.144)=0.763 3) Rescale the log odds ratio: 0.763/(sqrt(pi^2/3))=0.449 4) Re-sign the rescaled log odds ratio to create the logit -1(0.449)=-0.449

Attribution Tests

includes personality, vocational, voting propensity, polling propensity, etc. All kinds of measures that social sciences are interested in when determining reactions from people to different stimuli. Not trying to measure how well someone can do something, rather than measure one's attributing things to other things. Attributing personality characteristics to a person, behavior styles, etc. Attribute tests can be observational, self-reported, etc. In an attribute test, you wouldn't consider the idea of difficulty, discrimination, etc.

Proficiency Tests

measuring some sort of skill (some have lesser values, some have higher values). Takes a certain amount of skill to move through a proficiency measure Usually, easier items are administered earlier, vice versa. Ex: intelligence, achievement, running abilities, strength tests We are interested in the usefulness of an item in detecting who would be getting the higher scores and who would be getting the lower scores Difficulty factors: found in factor analysis. Also, introduce the concept of guessing (adjustment for guessing is considered for proficiency tests) *Ultimately, proficiency tests are used to predict how well an individual would do at a later task*

Item Response Theory (IRT)

model of designing, analyzing and scoring tests the standard error of measurement is not parallel to true scores, as measurement error is not a constant the line of perfect reliability (LPR) is not linear raw scores are a function of the trait level (theta)

sub stem

question stems that are nested under an item super stem

testlet

sub-stems that share a common super stem, creates dependency, more homogeneously correlated. also makes reliability correlation calculations complicated as it violates the assumption of independent items

Item Stem

the beginning part of the item that presents the item as a problem to be solved, a question asked of the respondent, or an incomplete statement to be completed, as well as any other relevant information

estimated true score

the method in true score theory by which one can estimate the true score of an individual based on their observed score: rxx*(X-Mx)+Mx where: rxx = the reliability coefficient, which is stated as a value between 0 and 1 X = the observed score Mx = the mean score of the test the less reliable the test (mean regression error), the more of a regression towards the mean one will see of estimated true scores

valence direction

the negativity or positivity that is evident in the wording of the stem helps to weed out people who are lazy in response better to have an imbalance weighted in one direction

mnemonics

the study and development of systems for improving and assisting the memory symbols without a meaning reduces respondent bias

Latent construct

theoretical in nature; they cannot be observed directly and, therefore, cannot be measured directly either. When a bunch of items pick up the same thing (a latent construct), we begin to hone in on the construct. A single item does not measure depression, but a multitude of items that all indicate depression may measure the depression construct When individuals take a latent construct test, they tend to be pretty naive when determining what construct the item is actually trying to measure (such as in a personality test). When people believe they understand a construct that the items are trying to measure, they may be deceitful in their answers

asymmetrical anchors

when the anchors create a skewed distribution as their central point is not in the middle of anchor choices

rescaling z-scores

when you want to adjust the data to a different mean and standard deviation: 1) calculate the z-score on the basis of the raw data 2) multiple the raw z-score by the new standard deviation and add the new mean value

Ver todos los conjuntos de estudio

EDUC 768 Measurement Theory and Test Construction

Conjuntos de estudio relacionados

Nederlands leren

Foundations of Nursing-Chapter 10

review

Autocad commands

Micro Exam I: Lectures 1-10

IXL: H.2,H.3,H.4,H.5,C.2,C.4 Card Set

Art Appreciation Chapter 1

Early World History

Life Science Final

Chapter 35. Personalizing the Conversation: Beethoven and the Classical Sonata; Listening Guide 24: Beethoven: Moonlight Sonata, I

Drivers Ed: Modules 3 and 4

HUM 300 Study Guide

CIT 370 - Chapter 3

Chapter 17

IG - Inspections Quiz

SSC102

Principles of Finance Exam 3

Water Soluble Vitamins Quiz

Life of Pi Quotes

biochemfinal