Preliminary Exam-Specialty

Ace your homework & exams now with Quizwiz!

Furr, R. Michael; Bacharach, Verne R.2008Psychometrics: An Introduction

(Chapter 1: Intro to psychometrics <hr> Criterion referenced test = takers either exceed the cutoff or don't Norm referenced test = purely comparative between takers These distinctions blur in reality Speeded tests (timed, graded on amount completed) vs Power Tests (untimed, graded on accuracy) Malingering = reverse social desirability responding <hr> Chapter 2: Scaling <hr> Three properties of every number: Identity (ability to categorize/reflect sameness vs differentness; no numerical value) Order (relative amount of an attribute possessed by someone; 2 is more than 1)3 Quantity (exact amount of an attribute posessed; highest amount of info) Always keep in mind if your scale's 0 is absolute or relative (to decide how to manipulate scores for analysis) <strong>Scaling= the particular way numbers or symbols are linked to observations to create a measure <hr> Chapter 3: Differences, Consistency, and Test Score Meaning Recap of mean, SD, r, z-scores, etc.) <hr> Chapter 4: Test Dimensionality and Factor Analysis (pg 108 in pdf) Questions to ask: how many dimensions? Are they correlated? Three types of tests: Unidimensional, Uncorrelated Factors &nbsp;(no total score calculated), Correlated Factors (total score calculated) For multidimensional tests, the psychometric properties of each factor (subscale) should be evaluated (reliability, etc.) Rest of chapter focuses on EFA (CFA is chapter 12) Factor extraction in EFA (use Principal Axis Factoring over Principal components analysis when you want to identify latent constructs (all EFA; Fabrigar et al. (1999) <hr> Chapter 5: Reliability (Conceptual Basis) Reliability is the extent to which variability on a n instrument is due to the underlying construct being measured (according to CTT); can also be thought of as lack of error (random or systematic); RATIO OF TRUE SCORE VARIANCE TO TOTAL VARIANCE Can also be thought of as the strength of the correlation between observed and true scores OR the lack of correlations between error scores and observed scores Variability among observed scores will always be higher than that among true scores because O scores also include error variance Standard error of measurement (SEoM) is the average size of an error score <strong>4 models of test comparisons (for calculating relibaility; relationship between two time points of a test &nbsp;--or the two halves of a test--determine what kind of reliability is calculated)<u>Must all meet the following CTT assumptions:</u> True scores are perfectly correlated with each other and uncorrelated with error Error scores are random Uniimensionality Linear relationship between true scores on test a and b <strong>More &nbsp;assumptions from Sam<strong>Expected value of X is TExpected value of error is zero <u>The Models</u> <ol> <li> Parallel tests = all assumptions met AND slope linking scores = 1, intercept linking scores is 0 and error variance is equal; reliabilities are equal &nbsp; </li> <li> Tau Equivalent all CTT assumptions AND slope/intercept clauses (BUT error variances aren't equal; reliabilities may not be equal) </li> <li> Essentially tau equivalent (CTT assumptions and slope clause ONLY) <strong>to calculate alpha, your test ITEM PAIRS must meet this level of equivalence</li> <li> Congeneric Tests (slopes are linearly related (slope of one = a(slope) + b of another; </li> <li> most general; only omega reliability works here; allows different items to contribute different amounts to total score based on factor loadings) </li> </ol> <hr> Chapter 6: Reliability (Empirical Estimates) Three general reliability estimation methods <ol> <li> Alternate forms (Correlate scores between multiple, parallel test versions; you get the average reliability of test forms A and B) </li> <li> Test-re-test (correlate scores between multiple testing occurrences; make sure &nbsp;true scores don't change between measurement occasions) </li> <li> Internal Consistency correlate scores across parts of the test (split the test into every possible half and correlate the <ol> <li> Split half (good unless different items on the test have different levels of difficulty, also needs to be corrected with spearman-brown formula) </li> <li> Raw coefficient alpha (sum of covariances between every item pair, divide by degrees of freedom) Carryover effects (related error scores between two test forms) can be due to practice effects, mood, etc. </li> <li> Standardized alpha (same as above but for Z scores) </li> <li> Omega (most versitile; uses factor analysis to estimate signal:noise ratio; doesn't require essential tau equivalence like alpha does) </li> </ol> </li> </ol> For difference scores (intraindividual or interindividual), the important factors are the reliabilities of each testing occasion and the correlation between each testing occasion Internal consistency really just tells us how well our items are sampled from a single domain <hr> Chapter 7: Importance of Reliability Item difficulty (in a non IRT framework) = what percentage of people answered correctly? How high is the mean? Extremely high means = low discrimination = poor psychometric rating <hr> Chapter 8: Validity: Conceptual Basis (I like the Messick (1995) definition which defines validity as an argument for a specific interpretation of test results, with different validity types being validity evidence to support the argument; luckily, the textbook uses it too) 5 types of evidence for construct validity: Internal structure (Dimensionality of instrument should match dimensionality of construct based on theory) Response Processes (Make sure the process you want to measure is being used to answer the inventory; i.e. an extraverted person sayingthey dont attend parties often because their definition of "often" is different Test Content (content validity; construct representativeness + lack of contamination from other constructs/construct-irrelevant variance) Associations with other Variables (discrimnant + convergent validity = associative validity; criterion and predictive validity too) Consequences of Use (Subgroup differences---&gt;potential for adverse impact, etc. <strong>Remember, reliability is a measure of the test scores themselves (and how strongly they correlate with actual differences on the underlying trait) while validity is about the soudnness of test score interpretation<hr> Chapter 9: Convergent and Discriminant Validity Evidence (only write down whats new) &nbsp;pg. 300; <strong>come back to thisfor the project; this chapter is a MONSTER Nomological network = network of meaning surrounding a construct (Cronbach &amp; Meehl, 1955) We generally use multitrait multimethod matrices (or just correlations) to determine Lovinger's paradox = reliability and external outcome validity are in contest; the higher you push reliability through related items, the less your correlation between test and external outcomes <hr> Chapter 10: Response Biases (Document page 352) <hr> Chapter 12: Confirmatory &nbsp;Factor Analysis "Freely estimated parameters" = we think the values are not zero so we let the software estimate (parameters in CFA = factor loadings and factor correlations) For fit indicies remember to interpret with respect to both relative and absolute fit Remember that chi square scales with sample size, so large samples are more likely to have a significant chi square (bad); this is why we present multiple fit indices We can also test if we have parallel test forms using CFA (we are essentially running a measurement invariance study across two forms of the test; specifically scalar invariance for tau equivalence) Remember: Configural Invariance = Same factors are present Metric = same factors and loadings Scalar = same factors, loadings, and intercepts (people with the average level of theta will have this value on an item) Strict = all of the above + same error across groups <hr> Chapter 14: IRT and Rasch (1PL) Models (again, only adding new information) If CTT = observed = true + error, IRT is: Probability of endorsement = item difficulty x respondent theta (not multiplied but a logistic function) <strong>Item discrimination (a) can be thought of like the ratio of item variance to total variance in CTT; lower a parameters can be caused by construct-irrelevant difficulty in the item(its also the ability to differentiate between high and low theta; .5 is relatively high) Guessing parameter = c and naturally gets smaller as the options increase <em>PL = </em>-parameter logistic model (how many parameters are present; all IRT models are logistic as they deal with probability of correct endorsement For the GRM (and other models for polytomous data), each response option gets a difficulty parameter, which makes sense since each option has a different probability of being endorsed at different levels of theta, so each option has a different 'inflection point' when probability of endorsement passes .5. Only one dscrimination parameter (all curves are the same shape, whatever shape that might be Estimating theta and the parameters (under the hood) is taking the natural log of the proportion of correct answers (for theta) and proportion of incorrect responses (for an item's difficulty) "Fit" in an IRT context is how well observed responses correlate with the expected responses suggested by the probabilities put forth by the model (were the high difficulty items answered correctly when the easy items were also answered correctly, but not vice versa--which would suggest a guessing parameter should be included etc.) <strong>Item Characteristic Curves (ICC) </strong>Are the graphs I always picture when thinking about IRT; probabilities of endorsement at different levels of theta <strong>Item Information/Test Information = IRT Reliability (but this essentially shows what level of theta the test provides more information--ability to discriminate between close cases-- at) Combine item information curves = test information curvesDIF, put simply, is when two individuals with the same theta from different subgroups have different probabilities of answering a question correctly (or endorsing options) <img alt="" data-attachment-key="WJHFR6HP" width="808" height="903"> <hr> Chapter 18: EFA (new stuff only) Principal Components analysis = like EFA except instead of just looking at common variance, looks at item total variance (explained + error variance); attempts to explain 100% of the variance in a set of items, instead of just the reliable variance (thats EFA). This is why criminology prefers this; assumes there is no error; treats every item as a component which has differential contributions to prediction/explanation Class Notes: 3 main uses for FA: <ol> <li> Understanding the structure of a set of variables (dimensionality/covariances) </li> <li> Constructing Questionnaires </li> <li> Reducing a dataset to more manageable size (item reduction, combining colinear variables for prediction like SES) </li> </ol> In a FA framework, item variance is broken down into SHared variance (common variance to the factor) and unique variance (specific variance + error ) Communality is calculated by squaring the loadings for an item across all factors and summing them; if you do the same with every item across a single factor (the reverse), this is an eigenvalue Uniqueness is the variance that is unique to just that item (1-communality); remember this is specific variance + error variance, but we can't actually separate them) SO Item variance = COmmunality (h^2) +Uniqueness + error If given the choice between overfactoring vs underfactoring, underfactor Sam likes to use a factor loading of .3 or above to retain items REMEMBER FOR HW and EXAM The Pattern matrix is the one you want for rotated factor analysis cross loading of .2 for pattern matrix is considered worthy of attention according to sam Idea of "simple structure" items should load onlto ONE factor and one only; cross-loadings of .2 should be removed <hr> Chapter 19: CFA (new stuff only) Class Notes Congeneric Reliability (OMEGA); ratio of variance explained by factor (truth) divided by total variance (truth + error); do this for each factor of a multidimensional scale <hr> </div>

Goodwin, Laura D.; Leech, Nancy L.2003The Meaning of Validity in the New Standards for Educational and Psychological Testing: Implications for Measurement Courses

<p dir="rtl">Review of what validity means according to the Standards for Educational and Psychological Testing (1999) <hr> Validity is no longer broken down into constuct/criterion/content, but is an overal argument for an interpretation of the test (we can have different kinds of evidence, but validity is unitary) Researchers argue about whether the consequenses of interpretations/uses should be considered (persriptive) in addition to the derscriptive nomological net 5 types of validity evidence Test-content (construct representativeness and lack of bias) Response processes (lack of socially desireable responding and other irrelevent pattern-affecting processes and halo) Internal Structure (CFA/DIF things) Other variables (convcergent + discriminant + criterion validity; the nom net) Evidence-based consequences The unitary view is better than the tripartite view for teaching students as well (easier to understand, keeps them from losing the big picture, etc.)

Campbell, Donald T; Fiske, Donald W1959CONVERGENT AND DISCRIMINANT VALIDATION BY THE MULTITRAIT-MULTIMETHOD MATRIX

<p style="text-align: left">Multitrait-Multimethos matrix is useful for assessing convergent and discriminant valididty between constructs AND providing information on how much /method/ contributes <p style="text-align: left">Guide to Reliability/Validity diagonals and method-trait columns on pg. 82 <p style="text-align: left">&nbsp;<img alt="" data-attachment-key="W8DNKK3A" width="842" height="503">

Review of Item Response Theory Practices in Organizational Research

A survey of how IRT is currently being used; best practices and when they are/aren't being followed, and how practitioners feel about IRT <hr> The recognizable begginings of IRT: Lord &amp; Nocik (1968) introduced item responswe functions and ML estimation of parameters Early IRT uses in I/O = fitting unidimensional IRT models to teh JDI and using IRt to look at adverse impact IRT beats CTT because: a) scale of measurement (theta) is separate from the items, allowing for comparisons between people who took different items b)item parameters do not vary by subpopulation (unless they do...ask mike if this includes dif) GRM model seems to be most popular by far (likely because of the use of polytomous items) Generalized graded unfolding model (Roberts et al., 2000) uses "ideal point" model instead of GRM's "dominance model" (if the individual is extreme, they will not endorse neutral items; look into this one) <strong>Remember: two main assumptions of traditional IRT: Local independence (after controlling for theta, item responses are uncorrelated) and unidimensionality (or at least a single factor that accoutns for MOST of the variance...at least 20%</strong>very few articles assess these assumptions; be sure to do this for dissertation

Schmitt, NealUses and abuses of coefficient alpha.

Alpha is not to be used as a marker of unidimensionality; using it as such can lead to misinterpretations For multidimensional measures, using alpha to correct for attenuation in correlations will overestimate true correlations Use of an absolute cutoff without considering test length is short-sighted, much like Cortina (1993) said <hr> When making an excuse for low reliabilities because of short test lengths, also include the caveat that any estimated relationships will be underestimated if using those scales, and consider developing a longer version in future directions

Zickar, Michael J.; Gibby, Robert E.; Robie, Chet2004Uncovering Faking Samples in Applicant, Incumbent, and Experimental Data Sets: An Application of Mixed-Model Item Response Theory

Applied Mixed MOdel IRT (IRT latent class modeling) to applicant vs incumbent personality inventory data to see if faking groups would emerge as their own class (also used honest and coached faking groups for a separate analysis to compare) Found that in the experimental study, honest class and extreme faking class (2 class solution) emerged in the applicant vs incumbent sample, 3 classes were needed: honest, slight faking, extreme faking (people even fake differently enough to form classes!) <strong>Really coolapplication of mixed method IRT <hr> The assumption that applicants fake and incumbents do not is not supported by this research, as fakers were present in both incumbent and applicant groups

Zhang, Bo; Cao, Mengyang; Tay, Louis; Luo, Jing; Drasgow, Fritz2020Examining the item response process to personality measures in high-stakes situations: Issues of measurement validity and predictive validity

Applied ideal point and dominance models to high-stakes (hiring) personality data. Found that: Ideal point models fit better regardless of high or low stakes Ideal point models demonstrated fewer DIF items across stakes conditions Items that were "intermediate" were identified in both faking and honest conditions (ideal point goated) Ideal point and Dominance models had similar predictive validity, so though ideal point is better, dominance won't get you in any hot water <hr>

Brown, Anna; Maydeu-Olivares, Alberto2011Item Response Modeling of Forced-Choice Questionnaires

Applying IRT to overcome the issue of =: Classical scoring methodology producing ipsative data when used on multidimensional forced-choice formats Introduce a multidimentional irt model based on Thurstone's comparative data framework which can be used for any forced choice questionnaire that fits dominance response model As long as the following conditions are met: Fewer than 15 traits measured by 10 items each ...this one is dense; come back to this one <hr> Forced choice can overcome issues with likert scaling (ambiguity of midpoints, extremiity/midpoint bias, etc.) Multidimensional forced choice (MFC) have different multiple choice options correspond to different factors (but only one can be chosen/ranked #1) e.g. either you relax often or you like swiss cheese, but you cant pick both Current scaling of MFC items leads to ipsative data (think DnD starting stats, everyone has the same points to distrubute, they're just in idfferent traits), which makes it imporrible to be above or below average on all scales, even though some poeople realistically should be; big psychometrics issue in terms of interpretation Thurstonian IRT models fix this First you code answers in terms of binaries (was this option ranked above this option) pairwise until all options are coded for all items

Reise, Steven Paul2010Thurstone Might Have Been Right About Attitudes, but Drasgow, Chernyshenko, and Stark Fail to Make the Case for Personality

Authors are unconvinced that we can apply ideal points effectivly to personality because (G R U M P alert): <strong>(a) questionable distinction between cognitive and noncognitive data as it relates to the applicability and interpretability of dominance response models <strong>(b) weak conceptual link between attitude measurement theory and personality trait assessment<strong>(c) reliance on the findings produced from a single (chi-square badness of fit) fit index when applied to a limited set of self-report measures,<strong>(d) failure to provide empirical evidence that personality trait scales, in general, fail to include items that provide measurement precision in the ''middle'' or ''intermediate'' level of the trait range.

Maydeu-Olivares, Albert2005Further Empirical Results on Parametric Versus Non-Parametric IRT Modeling of Likert-Type Personality Data

Authors believe that the misfit in Chernyshenko et al. (2001) (Graded response model didn't fit personality questionnaires) is actually due to multidimensionality present in the data. They verify this by comparing the fit of the GRM on an absolutely unidimensional problem solving scale, finding the GRM outperformed all other parametric models <strong>GRM is the premier irt model for polytomous responses, assuming unidimensionality is present

Cortina, Jose M.1993What is coefficient alpha? An examination of theory and applications

Authors claim that widespread misunderstanding of what alpha is is hindering out science and goes into detail about what alpha is and <strong>offers recommendations for proper use of alpha<strong>Alpha is the appropriate tool when item-specific variance in a unidimensional test is of interest<hr> Alpha takes into account variance resulting from subjects and from the interaction between subjects and items Alpha is the mean of all split-half correlations of a scale <strong>onlyif item standard deviations are equal; as this SD increases, alpha decreases (or if you use Flanagan's 1937 formula for split half reliability they are equal) <strong>Distinction: </strong>Internal consistency = intercorrelation between items, homogeneity = unidimensionality Alpha is related to, but does not imply homogeneity/unidimensionality Alpha precision = the lack of error in intercorrelation based on the sampling of items you have (use this to assess dimensinoality) Remember: Alpha of .80 represents double the interitem correlation for a 3 item scale than for a 10 item scale; <strong>always interpret alpha with respect to the length of the scale<hr> Having more than 20 items can produce an acceptable alpha even when interitem correlations are small Having at least 14 items and 2 orthogonal dimensions produces the same^

The Influence of Dimensionality on Parameter Estimation Accuracy in the Generalized Graded Unfolding Model

Authors used simulated data to test the parameter estimation accuracy of the GGUM in different bidimensionality situations (different sizes/strengths of second factors) Found that Estimation error increases as the proportion of items which load onto a second trait/factor increases In most cases, item extremity and conventional fit analyses should be able to identify items measuring a second trait Takeaway; use GGUM on unidimensional scales only, because GGUM will only select one of the two dimensions when modeling a bidimensional scale <hr> Remember, dominance model = normal; endorsement probability increases as theta increases, full stop (observed responses have a monotonically increasing relationship with theta); item response curves In ideal point models, once the item's theta level is lower or higher than the respondent's endorsement probability decreases &nbsp;(observed responses have a monotonically increasing relationship with difference between item discrimination parameter and person theta; option resposne curves (which each contain two subjective response categori probability functions--one for disagreeing from above and one for disagreeing from below, &nbsp;so a four-option item will have 8 curves)

Swaminathan, Hariharan; Hambleton, Ronald K.; Rogers, H. Jane200621 Assessing the Fit of Item Response Theory Models

Book chapter on IRT model fit (probably should've started here lmao) Re-teach yourself what liklihood ratio, loglikelihood and <hr> At the end of the day, assessing model fit requires two basic steps <ol> <li> <strong>Check assumptions (e.g. unidimensionality)Factor analysis (principal components = examine the largest eigenvalue/scree plot and go from there in terms of if you think single factor is plausible); due to the nature of IRT items, the highest percent variance we'll see is 35% or so. <ol> <li> We can improve upon this by generating simulated response data based on the distribution of theta to see what the highest possible eigenvalue we can expect is </li> <li> We can improve even further by using non-linerar factor analysis (as IRT is inherently nonlinear) through harmonic analysis Local independence = item responses are independent when conditioned on the complete set of latent traits (basically correlations between observed variables is 0 after partialing out common factors) </li> </ol> </li> <li> <strong>Assess agreement between observations and model predictions <ol> <li> Test level model fit = liklihood ratio statistic </li> </ol> </li> </ol>

Reckase, Mark D.2009Multidimensional Item Response Theory Models

Chapter 4: Multidimensional Item Response Theory Models <hr> MIRT = multidimensional IRT Two main types of MIRT models: Compensatory: Theta on each trait is linearly combined (added); low theta1 + high theta 2 can still achieve correct answer/endorsement (as either theta increases, probability of endorsement increases; though some items require significantly more of theta 1 than theta 2) Noncompensatory: Probability of correct response is the product of probabilities for each theta; low on one theta drags your probability down no matter what (unless other theta is extremely high; partially compensatory) Local independence assumption: Responses to all test items are independent of responses to other test items (all varation in response probablity is due to theta) MIRT doesn't use b (difficulty) parameters, it uses d (intercept) parameters, which can be made negative and divided by an element of the a parameter to get the difficulty on a given theta; called MDIFF instead of b Note; come back to this when you're uh...smarter lmao <hr> Chapter 6: Estimating Item/Person parameters for MIRT

Drasgow, Fritz; Chernyshenko, Oleksandr S.; Stark, Stephen2010Improving the Measurement of Psychological Variables: Ideal Point Models Rock!

Comments on how to correctly ideal point model responses in terms of dimensionality, CAT, DIF, and individual differences in responding <hr> Firstly, Likert scales work well as a stand-by for measuring most noncognitive items; have been shown to be valid for <strong>straightforward applications But remember, likert scales only make sense if dominance model is assumed to underly the response process

de Ayala, R. J.; Hertzog, Melody A.1992The assessment of dimensionality for use in item response theory

Compared ability of Multidimensional scaling (assumes a monotonic, not linear, relationship between item completion and ability) to CFA and EFA in terms of assessing dimensionality Found that MDS and CFA were both able to identify the number of latent dimensions in all item sets. EFA struggled when interdimensional correlation was high.

Sinar, Evan F.; Zickar, Michael J.2002Evaluating the robustness of graded response model and classical test theory parameter estimates to deviant items

Compared how Samejima's (1969) graded response model and Classical Test Theory handled (in terms of parameter estimation accuracy) items which that fail to adequately asses the construct of interest (deviant items) <strong>In general, the GRM estimates were more robust if there were more focal items than deviant items, the focal items were better at discriminating than the deviant items, and/or the focal/deviant items were positively correlated. CTT was more robust than GRM only under extremely unfavorable conditions <hr> Remember, the alpha parameter is analogous to an item-total correlation and is thus a measure of how much an items loads onto the underlying factor/how much total variance is accoutned for When faced with mutidimensional data, the 3PL and 2PL tend to track the most potent factor (as long as intercorrelations are above .6 or so

Meijer, Rob R.; Sijtsma, Klaas2001Methodology review: Evaluating person fit

Compares person-fit methods for CTT and IRT frameworks, as well as person-fit statistics in IRT at different item types and test lengths <hr> Person-fit methods = Appropriateness measurement; how far off from the proposed IRT model is a response pattern? Fit can be affected by cheating, good guessing, unfamiliarity with test format, etc. (think error sources!) Page 4 for individual person-fit indices...come back to this later

Dorans, Neil J.2004Equating, Concordance, and Expectation

Compares the scales of the ACT and SAT (uses multiple linkage techniques) Actual demonstration of linkage stars on pdf page 10 <hr> Determining linkability between two tests = construct similarity + statistical indicies + rational considerations <strong>Three Types of Test Score Links<u>Equated Scores (Most restrictive)</u> Producing <strong>test scores which can be used interchangeably(same construct, same metric); an equating function can always translate one into the other (inches to cm, though precision is different since cm also has mm); lack of reliability can affect this, Invariance across groups must be present <u>Concordant Scores </u> Can only be interchanged if construct similarity, reliability and invariance across populations is present (think past vs future ACT = yes, ACT vs SAT = no)...<strong>distributions and percentiles are comparable, but scores can't be switchedInvariance across groups doesn't have to be present <u>Expected Scores (least restrictive)</u> Minimized error in the prediction of one score by the other (e.g. test score from GPA) <strong>Which level is appropriate? Similar constructs measured (content evaluation) Relationships between scores in question (factor analysis + correlations) + high reliability Invariance of the linkage relationship across subpopulations Can a predictor (test score 1) reduce the uncertainty in the outcome (test score 2) by at least 50%? (i.e. are the two scores correlated at least r = .866?) If so, concordance/equating are possible

Clause, Catherine S.; Mullins, Morell E.; Nee, Marguerite T.; Pulakos, Elaine; Schmitt, Neal1998Parallel test form development: A procedure for alternative predictors and an example

Demonstrates a procedure (item-cloning) for developing alternate forms of tests which are parallel (similar means, SD, and factor structures) Considering item-by-item parallellism was the way to go To achieve true Cronbachian parallelism, creating parallel forms of each item may be required; and is certainly superior to domain-sampling (generating an item pool and pulling from that) <hr> To be considered parallel (Cronbach, 1947), two tests must have the same general factor loadings and sub-group factor loadings (item clusters), as well as true score means, SDs, and item intercorrelations Domain sampling procedure (Nunnally and Bernstein 1994) = create a pool of items representing the constructs of interest and randomly select to create alternate forms; this generaly works, as long as the items themselves remain unidimensional (unlike biodata) <strong>Item cloning procedure(authors did this one)= ensure each item on the original scale has a matching (via developers and experts)item on the alternate one; can work for multidimensional items

Drasgow, Fritz; Chernyshenko, Oleksandr S.; Stark, Stephen201075 Years After Likert: Thurstone Was Right!

Despite the dominance model recommended by Likert (which we have all been using forever), Thurstone's (1927) approach is a better representation of the ideal point choice process Lists why and when ideal point beats dominance <hr> Thurstone method (ideal point) Give items which represent the full range of theta, individuals will endorse more strongly the items which are closer to their underlying level <strong>Best used for response processes requiring instrospection ('how closely does this item describe me); use this for high fidelity information regarding choice processesExample item: Abortion rights are too lax (theta amount/location would be in parantheses afterwards, like 8.0) Even when removing intermediate items from a personality scale the 2PL and Unfolding model fit very similarly Likert Method (dominance) : Give items which ask about theta directly, and those with the highest item means are highest; respond strongly agree to positive and strongly disagree to negative (simpler, but can't use intermediate/double-barrelled items) <strong>Best used for evaluating against the difficulty of an item such as CATYou know what likert items look like

Walton, Kate E.; Cherkasova, Lina; Roberts, Richard D.2020On the Validity of Forced Choice Scores Derived From the Thurstonian Item Response Theory Model

Empirically validated the use of Thurstonian IRT using personality items from the Big Five and HEXACO Found: High amount of convergent validity with single stimulus tests of personality Lower divergent validity (high correlations between traits that shouldn't have been correlated) than ipsative scoring (problem); specifically when conscientiousness is split into two related factors and when using HEXACO (2/3 studies) High amount of criterion-related validity (though at times same level as ipsative) for satisfaction with life, GPA, and ACT scores for students This review of FC items is excellent; come back here to dive in <hr> Remember, FC (specifically multidimensional FC) is considered an antidote to faking

Detecting Faking on a Personality Instrument Using Appropriateness Measurement

Evaluating the use of IRT-based "appropriateness measurement" for detecting faking as compared to social desireability scales (using army personality data) <strong>Found that apropriateness measurement correctly classified more fakers with fewer false positives; bring this up as an IRT-based alternative to faking detection <hr> One reason organizations use ability tests and not personality inventories is due to fears of fkaing, despite the information theyre leaving on the table Previous ways of faking detection: Writing verifiable items (were you a member of x clubs in school?) Becker &amp; Colquitt (1992) Writing ambiguous items making it hard for the taker to tell what's being measured (at the cost of validity and possible DIF; Edwards, 1970) Social Desireability items peppered into a measure (sound good but should not all be endorsed) <strong>Appropriateness Measurement </strong>(The big dog of this article; Levine &amp; Rubin, 1979) <strong>Compares the observed pattern of item responses to the expected responses (considering a respondent's standing on theta and the item response functions)<strong>Respondents who have very different observed and expected patterns will have a high appropriateness index (high chance of cheating or some outside influence affected the response pattern)2 parameter logistic model is usually used for personality tests (2plm); works for dichotomous personality items

Drasgow, Fritz; Levine, Michael V.; Tsien, Sherman; Williams, Bruce; Mead, Alan D.1995Fitting polytomous item response theory models to multiple-choice tests

Examined how well the four most popular polytomous IRT models (Bock's NOminal, Samejima's Multiple Choice, Thissen &amp; Steinberg's Multiple choice, and Levine's Maximum likelihood scoiring model) did with actual multiple choice data despite being nonparametric and allowing ORFs to take whatever shapes they want Found that: You can't just model multiple choice responses by fitting the 3PL to dichotomous responses; polytomous models must be used (since they are so sensitive to violations of local independence) Bock's model fit surprisingly well; adding additional parameters didn't increase fit Levine's nonparametric model fit well across all tests as well; better than Bock <hr> Polytomous items improve over dichotomous models by allowing more information about theta to be extracted from a fixed set of items, increased abberant response pattern detection rate, and feedback about effective vs ineffective distractors <strong>Bock's (1972) Nominal Model Correct answer has highest a, and is selected most often by those with sufficient theta, while almost all those without sufficient theta will all select the same wrong option (with the smallest a); we don't see this pattern in multiple choice tests often (but it fit well anyway...weird) <strong>Samejima's (1979) Multiple-Choice Model CAllows for a "don't know" category in which respondents will guess <strong>Thissen &amp; Steinberg's (1984) Multiple Choice ModelLike above, except assumes that the don't know respondents will select options at different rates, not perfectly even rates (and can estimate these rates) <strong>Levine's (1993) MAximum LIkelihood FOrmula Scoring Model <strong><u>Do not use statistical significance to evaluate IRT model fit, as any IRT model with sufficient N will reject null since all models are technically misspecified</u>*Conditional ORFs show probability of selecting different incorrect answers at different levels of theta

Type I Error Rates for Generalized Graded Unfolding Model Fit Indices

Examined type I error (false positives) for 4 fit indicies using the GGUM Definitions and Rankings are as follows <strong>Least error (S-tier): Infit Outfit but weighted by variances; less sensitive to items far above or far below respondent's trait score Outfit (Standardized squared residuals (difference between expected and observed responses) Unweighted <strong>Conditionally good tier: Andrich's X^2 (Comes from the difference between expected and observed proportions <strong>across </strong>each response category) Log-likelihood X^2 Same as above but <strong>within </strong>each response category

Cronbach, Lee J.; Meehl, Paul E.Construct validity in psychological tests.

Explains construct validity at its most basic form <hr> 4 types of validity (as of 1955): Predictive validity (correlation with outcomes of interest) Concurrent validity (ability to be swapped for another related construct) Content validity: all aspects of the construct are represented within the items Construct validity: <strong>constructs measured account for variance in test performance</strong>; lack of alternative explanations Validatable via: conceptually correct subgroup differences, factor analyses, etc. "Bootstrap effect" the replacement of a criterion (e.g. how hot something is to the touch) with one that is more conceptually related to the construct and more accurate (e.g. a thermometer)

Roberts, James S.; Donoghue, John R.; Laughlin, James E.2000A General Item Response Theory Model for Unfolding Unidimensional Polytomous Responses

Explains the development and purpose of the Graded Generalized Unfolding Model (GGUM); this is an ideal point model <strong>(you can "strongly disagree from below or above; GGUM models and gives probabilities for both at all levels of agreement) <strong>Note that the x axis for IRCs is respondent's theta MINUS the standing of the object (the distance between the two) 1)GGUM allows discrimination parameter to vary across items (allows items to discriminate among respondents in different ways (?)) 2) Allows response category threshold parameters (?) to vary across items Sample size of 750 required for accurate parameter generation <hr> Ideal point process (Coombs, 1964): Respondents endorse attitude statements to the extent that it matches their opinion Thus, a single-peaked response function is best for analyzing agree-disagree responses In IRT terms, a respondent will agree with a statement to the extent that the person and statement are near the same amount of theta Parametric models, like GGUM, when properly used, are invariant to the items used to calibrate estimates, and item location (difficulty) are invariant to sample distributions, which both allow for item banking and CAT The GGUM specifically is better than other unfolded models because it: Doesn't assume people use response categories the same way across items AND doesn't assume all items have identical discrimination levels There are also expected value cuvers which scale strength of endoresement strength (y) by theta-item distance (x) <img alt="" data-attachment-key="MT3NEG5E" width="749" height="595"> The higher discrimination parameter goes, the more peaked the expected value (how strongly an item is endorsed) function becomes <img alt="" data-attachment-key="4T6JTRAL" width="957" height="839">

Cortina, Jose M.; Sheng, Zitong; Keener, Sheila K.; Keeler, Kathleen R.; Grubb, Leah K.; Schmitt, Neal; Tonidandel, Scott; Summerville, Karoline M.; Heggestad, Eric D.; Banks, George C.2020From alpha to omega and beyond! A look at the past, present, and (possible) future of psychometric soundness in the <em>Journal of Applied Psychology</em>.

Explores the past, present, and future of psychometric soundness measures and presents recommendations "Our field in its current state focus on alpha and on short measures, leading to serious psychometric flaws, such as induced grammatical redunancy-based common method bias Reporting interitem correlations and offering greater explanation of alpha is recommended <hr> Omega hierarchical (omega-h) reflects only the seecond order) factor common to all items (general factor variance/total variance); not affected by the existence of group (1st-order) factors; reliability is now the amount of variance not attributable to random error (so long as you are examining a unidimensional scale) <strong>Djuedjevic et al. 2017 is a model scale development article (defines construct/boundaries of construct, generates fully representative items, SME checks items, factor structure test, reliability test, nom net dev)Note: Check Messick (1995) to read about construct representativeness in item generation Always focus on Construct representation, correspondence (do the items map onto the construct), and discrimination (do the items NOT map onto related constructs)

Roberts, James S.; Lin, Yan; Laughlin, James E.2001Computerized Adaptive Testing with the Generalized Graded Unfolding Model

Found that, using the GGUM (unfolding model) within computer adaptive testing, it is possible to obtain accurate person location estimates (get at theta) with a good level of presicion, using a small number of items (as few as 7 or 8) (specifically looked at student abortion stances) and around 750 respondents <hr> We call the relationship between the person location and item location in the GGUM the "Proximity Relation" (theta - delta) This is a VERY good review of the unfolding model; come back to this one

Stone, Clement A.; Zhang, Bo2003Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures

Goes over the advantages and disadvantages of alterative goodness of fit measures for IRt models <strong>Come back to page 19 for the pros and cons of each GoF statistic once you understand more about chi-square distributions<hr> Traditionally, we assess IRT fit by comparing theta subgroups in terms of expected vs observed scores (a good model, obviously is one that can accurately predict response patterns/frequenciesat all levels of theta) <strong>Steps<ol> <li> Estimate item and ability parameters </li> <li> Form a number of theta subgroups to approximate distribution </li> <li> construct observed score distribution </li> <li> Generate expected score distribution based on the chosen model and parameter estimates </li> <li> Compare observed and expected score; usually via model chi-square (badness of fit index; higher = worse) </li> </ol> Issues with using X^2: CHosen number of theta subgroups is up to researcher and can tangibly influence results, sensitive to sample size <strong>Of traditional IRT fit statistics, Likelihood ratio X^2 based on Yen's (1081) Q1 statistic was most accurate <hr> <strong>ALTERNATE PROCEDURES <hr> <u>Orlando &amp; Thissen </u>(2000) <strong>I used this fit index for my thesis (signed chi-square)Compares expected frequencies to observed freuencies for <strong>scores (total scores on a scale)</strong>, so theta estimates aren;t needed for observed proportions <u>Stone et al. (1994) </u> Using posterior expectations (using all available information to update prior probability; its Bayesian) to account for uncertainty in theta estimation. These authors claim that, particularly in short tests, uncertainty in theta estimation led to deviations of GoF approximations

Jones, Lyle V.; Thissen, David20061 A History and Overview of Psychometrics

History and Overview of Psychometrics (don't include in flashcards) 6 <hr> Psychometrics: Origins Psychometric Society founded in 1935 at the University of Chicago by Thurstone and colleagues---&gt; First volume of Psychometrika in 1936 (official start) Could be considered to start back in the 1860s with Fechner's psychophysics

Bock, R. Darrell1997A Brief History of Item Theory Response

How IRT got its start...not much meat on this one; seemed written for stats/engineering folks... <hr> First blip in Thurstone's (1925) attempts to scale Mental development test items (percentage of students getting it correct) on an age scale (age acted as theta here); calculated mental age not as how many questions were right but the highest difficulty item a child was able to get correct (like modern CAT)

Colquitt, Jason A.; Sabey, Tyler B.; Rodell, Jessica B.; Hill, Edwyna T.2019Content validation guidelines: Evaluation criteria for definitional correspondence and definitional distinctiveness.

I/O scholars (apparently) tend to focus less on content validation than other types of validity evidence Lay out two strategies to content validate (Anderson &amp; Gerbing, 1991 and Hinkin&amp; Tracey, 1999) and how to evaluate the level of content validity We evaluate the: Definitional correspondence &nbsp;(items correspond to construct definition) and Definitional distinctiveness (items correspond more to construct definition than those of orbiting constructs) <hr> Andeorson &amp; Gerbing content validation approach -(naive)Judges are given one set of items and multiple (related)construct definitions -Judges sort the items onto the appropriate construct definition pile -Calculate number substantive agreement (correct sorting judges/sorting judges) and substantive validity ( correct sortings-incorrect sortings/all sortings) Hinkin &amp; Tracey Approach Same as above, but judges rate likert-style "how well does this item fit X construct" questions instead for each item -Calculate HinkinTracey Correspondence (average rating/number of anchors) Does the same with orbiting constructs to calculate hinkintracey distinctiveness What constitutes strong content validation evidence? Depends on amount of correlation between focal scale and orbiting scales <img alt="" data-attachment-key="H54ISMLU" width="1205" height="762">

Sun, Tianjun; Zhang, Bo; Cao, Mengyang; Drasgow, Fritz2022Faking Detection Improved: Adopting a Likert Item Response Process Tree Model

IRTree models to detect &nbsp;faking <hr> Brockenholt (2013) Three Process Tree model When deciding between an odd number of responses, respondents think through 3 phases Indifference (neutral or no?) Direction (agree or disagree?) Intensity (strong or regular?) Can be applied to even numbers by removing the indifference step Direction and intensityt responses have reflected aquiesence/extreme response styles well <strong>Respondents faking should show a stronger "intensity" process = more extreme responses (this is supported by their studies, both in induced and natural high-stakes faking)

Borman, Walter C.2010Cognitive Processes Related to Forced-Choice, Ideal Point Responses: Drasgow, Chernyshenko, and Stark Got It Right!

Ideal point model is superior to dominance model for introspective judgements (as evidenced bhy ideal-point based computer adaptive response scales &nbsp;for performance ratings nstead of dominance based BARS, since raters are more likely to use ideal point thinking in terms of which of two options describes the ratee better)

Stark, Stephen; Chernyshenko, Oleksandr S.; Drasgow, Fritz; White, Leonard A.2012Adaptive Testing With Multidimensional Pairwise Preference Items: Improving the Efficiency of Personality and Other Noncognitive Assessments

Introduces CAT using Multi-unidimensional Pairwise Preference (MUPP) items (a form of forced-choice) Through four simulations exploring test design, dimensionality,and error estimation Found that adaptive MDPP testing produces gains in accuracy over nonadaptive MDPP much like unidimensional <strong>Takeaway: CAT works with multidimensional forced choice non-cognitive items too, and remains construct valid<hr>

Carter, Nathan T.; Dalal, Dev K.; Lake, Christopher J.; Lin, Bing C.; Zickar, Michael J.2011Using Mixed-Model Item Response Theory to Analyze Organizational Survey Responses: An Illustration Using the Job Descriptive Index

Introduces Mixed-Model (IRT + LCA; we're looking at clustering baby) IRT and how to use it for: Checking scoring assumptions Identifying the influence of systematic responding (unrealted to item content) Evaluate individual/group difference variables as predictors of class membership <strong>THIS IS THE COOLEST THING EVER<hr> Combines IRT models with latent class analysis (ohhhhhh im excited about this) <strong>Main function is itentifying subgroups for whom the item response-theta relationship (Item response functions) are considerably different (I SHOULD USE THIS FOR MY DISSY?); after identifying the classes, you can look into variables that predict group membership (this is where i can slip in the theory!!) Since standard IRT assumes all individuals come from the same subpopulation (but this is likely not always true!); we can use MMIRT to get at subgroups that we cant ferret out using a priori dif analysis You just run the program while continuing to increase the number of latent classes you expect to find and stop when the fit stops increasing (data-driven which is ehhhh but i can find a way to tie some theory to it)

Dueber, David M.; Love, Abigail M. A.; Toland, Michael D.; Turner, Trisha A.2019Comparison of Single-Response Format and Forced-Choice Format Instruments Using Thurstonian Item Response Theory

Introduces thurstonian IRT as a way to measure forced choice items (instead of likert-style single response); intro to forced choice really this one shouldv'e been first <hr> Again, the pros of FC are the avoidance of aquiescence, extreme responding, social desirability, and systematic score inflation, and avoiding construct irrelevant variance Analyzing FC (ipsative) data presents psychometric difficulties, which can be overcome using thurstoniN irt (bROWN, 2010) Single response tends to have difficulty discriminating between high levels of theta (since these individuals will endorse all items at the high level), but FC can fix this

Zhang, Bo; Sun, Tianjun; Drasgow, Fritz; Chernyshenko, Oleksandr S.; Nye, Christopher D.; Stark, Stephen; White, Leonard A.2020Though Forced, Still Valid: Psychometric Equivalence of Forced-Choice and Single-Statement Measures

Investigated if forced choice and single statement measures were measuring the same underlying constructs Using both between and within subjects design, foudn: High degree of convergent validity between Multi-unidimensional pairwise preference (MUPP; FC) and unfolding GGUM (SS) models, as well as similar levels of dioscriminant validities and criterion-related validities. Furthermore, increased cognitive load of FC didnt seem to have differential impact on respondent's emotional reactions (except being perceived slightly harder and as saving time) ; Publisher Copyright:<br/>© The Author(s) 2019.

Feldt, Leonard S.1997Can Validity Rise When Reliability Declines?

It is said that Performance tests can be more valid than multiple-choice tests despite their lower reliability However, classical test theory states that reliability places limits on validity. How can both be true? <hr> Lower reliability ALWAYS = lower validity ONLY IF the lower reliability is a result of shortening the test (i.e., the removed portions must have the same variance as the rest of the test) <strong>Reliability decreasing due to alterations in the CHARACTER of the test (e.g., removal of construct irrelevant variance that was aiding reliabilty) CAN increase validity

Salgado, Jesús F.; Anderson, Neil; Tauriz, Gabriel2015The validity of ipsative and quasi-ipsative forced-choice personality inventories for different occupational groups: A comprehensive meta-analysis

META compared ipsative with quasi ipsative measures of personality across job types <strong>Quasi-ipsative measures held superior predictive validity and validity generalization in terms of forced-choice questionnairesQuasi-ipsative = more like a profile, where scoring high in one area doesnt take away the ability to score high in other areas

Thompson, Nathan; Weiss, David2019A Framework for the Development of Computerized Adaptive Tests

Practical guide for creatign a computer adaptive test <img alt="" data-attachment-key="9462UFAF" width="442" height="387"> Return to this if it seems that CAT questions are common for psychometrics/IRT specialty areas (check this)

Brown, Anna2016Item Response Models for Forced-Choice Questionnaires: A Common Framework

Presents a unifying framework for use on all forced-choice IRT data Models differ by 3 factorsz; Which FC format is used (pick between alternatives vs ranking with or without ties) Measurement model of the item/theta relationship (dominance vs ideal point) Decision model for choice behavior (Logit link (utility judgements are independent) vs probit link (utility judgements are related) <hr> Forced choice is the opposite of SS (single stimulus); comparative judgement vs absolute judgement Utility (Thurstone, 1929), is the emotion called forth by an item, in FCIRT, utility is usually the amount of agreement with an item (divide utility into individual differences on construct and person x item interactions; one of which is alwaysd treatted as error...this is the Random Utility Model) <strong>Think of utility as the distance from the item location in ideal point model axes Linear Factor Analysis Models = utility is a linear function of the item mean and personal attribnutes; usually used to measure one attribute only and can approximate well when items represent truly positive or negative standing on the continuum\ Ideal Pooint Models

Drasgow, Fritz; Lissak, Robin I.1983Modified parallel analysis: A procedure for examining the latent dimensionality of dichotomously scored item responses

Proposes Modified Parallel Analysis as a tool to determine when the unidimensionality assumption of IRT is violated (which it is in most real data sets) <strong>to the extent that parameter estimation is impacted(used for dichotomous items) Through monte carlo simulations, MPA is found to detect unidimensionality violations that interfere with parameter estimation A lot like regular parallel analysis, except you specify the model for which to generate the random responses, and the generated set always satisfies the unidimensionality assumption of IRT <hr> Drasgow &amp; Parsons (1983) found that TRULY unidimensional latent trait space is not required for accurate parameter estimation (as long as there was a general factor underlying all first-order factors and the correlations among the first order factors were above .46)

Carter, Nathan T.; Lake, Christopher J.; Zickar, Michael J.2010Toward Understanding the Psychology of Unfolding

Proposes a research agenda focused on understanding the psychology of unfolding models Specifically, while the authors agree that introspection is an important part of the process, there are three main issues they want to look at: <strong>1)Individual differences in the use of ideal point processes Some individuals may be disposed to using ideal point or dominance models MENTALLY when responding, regardless of how the item was written (I like to go to parties sometimes" can be difficult for those raised in a dominance-only culture) literal-mindedness is also a factor Amount of thought/effort put into items can also affect adherence to predicted response patterns, such that those ho put a large or small amoputn of effort/thought in will depart from expected patterns (insufficient AND overly sufficient item responding) <strong>2) Item/stimulus content and administration modality (why do some items unfold and others do not?) Administering items one at a time seems to show promise in discouraging respondents from loooking at extreme neighboring items and thus using dominance-style thinking (also decreased standard error) Vague quantifiers (somewhat, sometimes), seem &nbsp;hold promise in forcing unfolding (encouraged it in half of conscientiousness items, but not in the other half so there are more factors) <strong>3) Nature of the attribute being measured The authors hypothesize that the more abstract a construct, the more unfoldign we will see

Messick, Samuel1994Validity of Psychological Assessment: Validation of Inferences from Persons' Responses and Performances as Scientific Inquiry into Score Meaning

Proposes a unified view of construct validity which subsumes all others. Made of six aspects: Content (content relevance, representativeness, and technical quality) Substantive (Theoretical rationales for observed test response patterns) Structural (Scoring accuracy/fitness) Generalizability External (Convergent + divergent validity) Consequential (Implications of use of test scores as bases for real decisions) Construct validity is an evaluative summary of the evidence for and consequences of score interpretation Deficiency = failure to capture entire construct; contamination = measuring of constructs that aren't suppsoed to be measured <hr> Construct irrelevant difficulty/easiness can allow individuals to score much differently from true scores (i.e. making a math problem also require high reading ability for dificulty)

Borsboom, Denny; Mellenbergh, Gideon J.; van Heerden, JaapThe Concept of Validity.

Proposes an alternative conception of test validity comprised of TWO things,. A test is valid if: The attribute its measuring exists Variations in said attribute causally produce variation in measurement outcomes That's it. Does it measure a real thing, and are different scores caused by variations in that thing. Author is focused on alligning the definition of validity with the "does it measure what it reports to measure" idea i (Kelly, 1927) &nbsp; <hr>

Kane, Michael T.2001Current Concerns in Validity Theory

Shows an argument-based approach to two main questions in validity theory: performance based vs theory based interpretations and the role of consequences <hr> Stages of validity through the years 1950: Cureton- criterion-based validity (how well to test scores and task scores covary?) -Doesn't validate the criterion beyond "seems right", and requires a suitable criterion to be present; Content validation (via SMEs; construct representativeness) for less "ado" about suitability 1954 Cronbach and Meehl: Constuct Validity (understanding which underlying constructs explain variation in test scores; focuses on explanatory interpretations; think nom net) --&gt;Eventually grew into general validity (including reliability and all other types of validation); construct validation is the unifying principle Construct validity has a weak (purely exploratory empiricism; required when little extant theory is ther) and strong (theoretically based; basically theory testing) "programs" <strong>I'd argue that our current science is somewhere in between, where, while we approximate strong theory as best we can, divergent/convergent validity are often what we rely on <strong>Remember, validity (in its current form) is an evaluation of the appropriateness of the interpretation of scores on a given test; as such, we should think about making validity INTERPRETIVE ARGUMENTS for test interpretations, not just researching validity

Dalal, Dev K.; Withrow, Scott; Gibby, Robert E.; Zickar, Michael J.2010Six Questions That Practitioners (Might) Have About Ideal Point Response Process Items

Six questions that highlight gaps in out current ideal point model knowledge (specifically that practiotioners might want to know) <ol> <li> <strong>How can we score ideal point measures without irt parameters? We don't know man, IPMs need large samples to score </li> <li> <strong>What do if we want to create an ideal point scale but have a small sample size? Out of luck, you're gonna want around 750 folks to get reliable estimates for a 20 item scale </li> <li> </li> <li> <strong>How do we write ideal point items? We need to get on making some guidelines for this, though we know from Carter et al. (2010) that vague qualifiers are part of the puzzle </li> <li> <strong>What are the imnplications for scale length using ideal point assumptions? We don't know how many items you need to acheive reasonable theta estimates; but we know ideal point scales need to be longer than dominance ones to untangle the disagreemets from below and above </li> <li> <strong>Can we expect higher criterion-related validity from ideal point measures? Drasgow et al. (2010) suggested more precision, and we may see rank order changes using ideal point models even if the correlations don't change all that much since we may see changes at the top of the distribution </li> <li> <strong>Can we apply ideal point models to engagement and satisfaction? Totally, might even be better since we can now better conceptualize a "neutral" midpoint </li> </ol>

Reise, Steven P.; Henson, James M.2003A Discussion of Modern Versus Traditional Psychometrics As Applied to Personality Assessment Scales

Speculates how IRT can be applied to personality measurement (in 2003 before we did this more regularly), by comparing IRT to Classical Test Theory <hr> IRT is the norm in large-scale cognitive assessment <strong>CTT: Observed score = true score + random error; answers depend on items themselvesTwo measures are parallel if true score variance is equal across them Test-dependent (psychometric propertites change if an item is added or subtracted) <strong>IRT: Person has a true Theta, which influences probability of answers, doesn't depend on items themselvesItem response function = model for responses based on theta Item difficulty (b parameter) = the theta point at which endorsement becomes 50% likely (-2~+2) Estimates joint relationship between item properties and test-taker properties Unidimensional IRT = after controlling for theta, there is no relationship at all between item responses (this must be tested beforehand) <hr> IRT Jargon Item Difficulty INdex = mean difficulty of all items in a scale Item Test Correlation =correlation between item and total scale scores (is an item measuring the same construct as the rest of the test or not?) this is IRT's reliability Discrimination parameter (a) = the slope of the Item response function (HOW response probability changes as a function of theta; we allow the shape of the curve to change; higher slopes = better ability t odiscriminate between higher levels of theta) a = discrimination = shape of curve b = difficulty = position of curve Item Information Function = how well an item differentiates different levels of theta; can be added to become the SIF/TIF (test/scale information function); inversely related to standard error <strong>We can use responses to items to estimate levels of theta, which is the basis of CAT (starting with low discrimination/difficulty items and continuing to increase); can be half the length of standard tests with no loss of precisionWe can even compare the theta of people who take different tests through IRT scale linking <hr> IRT should be used to measure personality, mainly narrow traits (academic self-esteem moreso than general self esteem) It should also be used for personality as theta can be calculated regardless of what items are skipped Still on the fence in terms of recommending IRT for personality in general, but this was 20 years ago

Spector, Paul E.; Brannick, Michael T.2010If Thurstone Was Right, What Happens When We Factor Analyze Likert Scales?

Supports the claim by Drasgow et al (2010) that applying an incorrect (dominance) model to a distribution of responding which actually contains unidimensional unfolding can mimic a multidimensional dominance model; applying incorrect models can mislead, go figure!

Drasgow, Fritz; Parsons, Charles K.1983Application of unidimensional item response theory models to mutidimensional data

THIS is the citation saying that IRT models can be applied to item pools that are moderately heterogeneous (have a moderaterely strong general factor) Generated 5 different item sets with varying levels of "prepotency" (general factro strength), found that models were able to recover the general factro for moderate prepotency (70% of variance accoutned for by general factor, 10% strongest 1st-order group factor) and up <hr> Decreasing the strength of the general factor in an item pool can cause unusually high difficulty parameter estimates

Dudek, Frank J.The continuing misinterpretation of the standard error of measurement.

Talks about standard error of measurement as an estimate of variability expected for observed scores when the true score is held constant Remember that the SEE is a confidence interval aroudn the true score, not the observed score

De Ayala, R. J.1995The Influence of Dimensionality on Estimation in the Partial Credit Model

The partial credit model seems to estimate the average theta (when presented with two related thetas) more accurately than either theta alone, though all errors decrease as the data approaches unidimensionality (via higher correlations between theta 1 and 2, though one theta is conbsistently better estimated than the other)

Credé, Marcus2010Two Caveats for the Use of Ideal Point Items: Discrepancies and Bivariate Constructs

Two caveats for ideal point models (specifically focusing on double-barrelled, "I have equal amounts of A and B" style items): <strong>1: Assessing discrepancies Asking respondents to mentally compute differences between things (i.e. existing amount of freedom and desired amount of freedom) will not always lead to accurate calculations (Edwards, 2001), or may be assessing something entirely different from actual percieved difference <strong>2: Applications of ideal point models to computer adaptive testing (Specifically for bivariate constructs)Research on the <strong>Bivariate Evaluation Plane (Cacioppo &amp; Berntson, 1994</strong>) &nbsp;suggests that being able to assess both positive and negative attitudes simultaneously (how much neg AND how much pos) allows identification of those who hold conflicting views AND increases predictive power for behaviors; double barrel ideal point items cant do this This can be an issue when the item "My life has equal ups and downs" could describe bipolar or depressed people (many vs none)

Hattie, John1985Methodology Review: Assessing Unidimensionality of Tests and ltenls

Unidimensionality = existence of one latent trait underlying the data Reliability = true score variance ratio to total variance Looked at all unidimensionality indicies: Resposne-pattern-based (Loevinger's index) Reliability-based (Alpha) Principal components based (Eigenvalues) Factor Analysis -based (Omega) MANY OF THE ABOVE FAIL BECAUSE THE YARE BASED ON LINEAR MODELS Latent-trait model-based (Residuals of applying specific models)*** <strong>According to this author, the most useful way to detect unidimensionality (or lack thereof) is examining the size of residuals after fitting a 2PL or 3PL. "If the sum of residuals after specifying one dimension is reasonably small AND if the sum of residuals agter specifying two dimensions is not much smaller, then it can be confidentally assumed that the item set is unidimensional" <hr>

Modeling Faking Good on Personality Items: An Item-Level Analysis - SEARCH

Used an army sample (who were instructed to fake good, coached to fake good, or answer honestly)on a personality test. They examined responses using samejima's (1969) graded response model and foudn that: Though theta estimations were largely different across conditions, there was very little DIF...which is bad. This means that GRM can't detect faking on its own; this is why we need appropriateness measurement (Drasgow et al., 1996) CFA after the fact shows that faking leads to an increase in common variance unrelated to construct variance; likely a social desirability general factor) (Obviously this was single-stimulus, not forced choice (so no artificial ipsativity), and dominance, not ideal point (so no artificial multidimensionality) Possible support for theta-shift model (person-change), since fakers (especially coached ones) had higher theta estimates than honest folk <hr> Theta-shift-model (Zickar and Drasgow, 1996); some items are responded to honestly, but those which are "fakeable" (hard to verify and/or transparent) will be responded to as if respondent theta was higher/lower)

Reise, Steven P.; Widaman, Keith F.1999Assessing the fit of measurement models at the individual level: A comparison of item response theory and covariance structure approaches

Used both IRT and CSA (covariance structure approach) methods to identify individuals with low person-fit statistics Found that the groups identified by each method as not fitting were largely different Recommend not conducting statistical tests of person fit using IRT or CSA-based fit scores (as departures from normality are present), and instead to obtain simulated values to compare fit statistics to <hr> CSA is like IRT but instead fo estimating response frequencies and evaluating how well the data fit, it estimates covariance matrices.

Lang, Jonas W.B.; Tay, Louis2021The Science and Practice of Item Response Theory in Organizations

Uses of IRT in the appliead research world (scale development and theory testing) <hr> 1PL (Rasch) model = a single dichotomous item (correct/incorrect); difficulty (b) parameter only 2PL model (Birnbaum, 1968) = dichotomous item, but with difficulty (b) and discrimination (a) parameter 3PL = includes a guessing parameter IRT, unlike CTT, can calculate a specific standard error for each level of theta (instead of assuming that levels of error generalize from a norming sample like CTT does) Ideal Point Models focus on accurately causing individuals to respond to /levels/ of a construct (they will select "happy" when they are happy but will not select "very happy") Models individuals endorsing whichever item is closest to their true standing (instead of a "very happy" person being more likely to endorse "happy" Tree Models = models the decision trees in respondents' heads as they differentite response options (first neutral vs valenced, then pick direction, then pick strength, etc.) Dynamic IRT models (as opposed to Thurstonian, traditional ones) assume that previous answers can affect future ones and can model motives (like social desirability, faking, etc.) Four Critical areas where IRT contributes to practice: <strong>Testing/Assessment Allows comparisons of respondents who took different i(form-linked)tem sets and sets all ability parameters to a common scale Also Computer Adaptive testing Ensures that constructed tests are unidimensional (no construct-irrelevant variance here!) Exploratory IRT (like the LLTMe framework) can explore what differences between items drive difficulty parameters <strong>Multidimensional IRTcan be used to estimate models with one theta + a method factor (a bi-factor model could work here, for example) or multidimensional scales <strong>Questionnaire Responding Can study how rating scales are uses (like the JDI "?" option being used in different ways by different groups) <em>Process </em>IRT Models model...well... underlying response processes, that can span multiple items Construct ValidationIRT uses the Borsboom (2004) view of validity (The construct exists and variations in it cause variations in test response) instead of the whole nom net + concurrent validity stuff (we don't have to predict, we can directly model the variation in responses caused by theta!) <strong>Group Equivalence of Scores (Measurement Invariance babeyyyy)Use IRT to find DIF (one group guesses significantly more at same theta levelusing 3PL on specific items because wording is too difficult, for example) <strong>Challenges with current IRTUnderstanding how different IRt approaches and softrwares differ from established approaches Different terminology from experts can be confusing for the uninitiated Evaluating IRT model fit is new and can be inaccurate despite high fit; Parameter Recovery Studies recommended <img alt="" data-attachment-key="BYM6GT7W" width="810" height="851">

Carter, Nathan T.; Dalal, Dev K.; Guan, Li; LoPilato, Alexander C.; Withrow, Scott A.2017Item response theory scoring and the detection of curvilinear relationships

We all know that curvilinear effects are difficult to detect, but the authors claim that <strong>correctly specifying underlying response pattern is important for detecting curvilinear effects(ideal point vs dominance) This is supported by 2 simulation studies (using ideal point models for ideal point data and dominance models for dominance data; evidenced by low type 1 error rates and high power <hr> Remember, dominance models make S shaped IRCs while ideal points/unfolding make normal curve-shaped ORCs

Brown, Anna; Maydeu-Olivares, Alberto2010Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy

While the authors agree with Drasgow et al. (2010) in thst respondents to introspective questions alwwys involve comparisons, the authors disagree that the comparison point is always self perception vs statement's location RATHER the authors argue that, depending on the construct, ideal point models may be inaccurate as respondents are instead using a "Endorse if high enough location" and that's it (dominance) More usage of intermediate items will help bear this out, as when only extreme items are present, ideal point and dominance models hold the same patterns <hr> Some issues with ideal point: <strong>It is harder to write intermediate items for ideal point models Writing intermediate items that arent double barrelled is hard but required These items should also not reference an external group (my room is cleaner thabn <strong>average</strong>), as respondents' ideas about normal will differ Items should not be overly specific context-wise (try "I enjoy chatting" instead of "I like to chat at cafes) <strong>Estimation may be less accurate for ideal point models Ideal point models are still new, and it has not yet been shown that estimations of item parameters (and thus item characteristic curves) are as accurate as with the tried and true dominance models <strong>Ideal point models are not invariant to reverse coded items modeling dissatisfaction will have a different set of parameters from modeling satisfaction, which is not the case with dominance <strong>Ideal point models aren't requried to model forced choice responding<strong>Ideal point unidimensional and multidimensional domninance items can look the same, so think carefully about which model to apply

Waples, Christopher J.; Weyhrauch, William S.; Connell, Angela R.; Culbertson, Satoris S.2010Questionable Defeats and Discounted Victories for Likert Rating Scales

You could technically still model unfolding items with likert scales Likert items can capture intermediate levels of traits with their midpoints Likert scales are more reliable and practical as well

Sireci, Stephen G.; Parker, Polly2006Validity on Trial: Psychometric and Legal Conceptualizations of Validity

in general, there is strong congruence betweentheStandardsand how validity is viewed in the courts, and thattesting agencies that conform to these guidelines are likely towithstand legal scrutiny. However, the courts have taken a morepractical, less theoretical view on validity and tend to emphasizeevidence based on test content and testing consequences. <hr>


Related study sets

TMC110 Module 3: Entrepreneurship

View Set

Chapter 11 Developing and Managing Products

View Set