Comps - Methodology

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

PSYCHOMETRICS - Computer Adaptive Testing (CAT) Lecture Notes Previous comps question: "What are some practical issues with CAT?"

1) What do you use as an initial ability estimate (before you have any data)? The statistically optimal starting value is the mean of the theta density (usually the density is assumed to be standard normal). But this slightly biases test-takers' scores towards the mean. Other choices might result in more bias overall but possibly more acceptable. 2) How many item are required? This is a tricky question and frequently misunderstood. In order to give everyone a very reliable tailored test, potentially very large numbers of items may be needed. Also security concerns may dictate having a very large item pool. However, any test can be administered adaptively. The resulting CAT may be shorter while still being roughly as reliable as the original test (the original test length and reliability are limiting factors). 3) What are some "stopping criteria" which end the administration? Commonly, tests have limits on the numbers o items administered, and sometimes the total administration time. Some CATs also have an accuracy criterion. When the calculated SEM Falls below a threshold, the test is terminated. 4) Is CAT a panacea to all security concerns? Unfortunately, no. CBT solves some problems like lost test booklets or people copying from a neighbor (Because the items are typically administered in a random ordering). But new problems are introduced, such as the possibility of the test being stolen and duplicated electronically. Also, most CBTs are administered "on-demand" which can cause sig. security problems. The adaptive administration itself poses security issues. Operational CATs use an "exposure control" algorithm to introduce randomness into the choice of items. Without these algorithms, a few items will appear on almost everyone's tailored tests and may well become compromised quickly. Also, unless the adaptive algorithm is amended, it will not use the majority of any sizable item bank (because most of the items are not "most informative" for any theta" 5) Do candidates accept CATs? Reactions vary and some candidates appreciate the shortened timing an on-demand testing. However, many examinees dislike the restrictions placed upon CAT administrations, such as not being able to skip or return to a previous item. There was initial concern that high-ability test-takers would perceive the test as hard and unfair and which have lead to modified algorithms that pick slightly easier items. However, there is no evidence that CATs differ from paper versions. 6) Do course accepts CATs? No legal precedent exists. For a long time, there were few instances of adaptive testing and apprehension about explaining the fairness of iRT scoring to a judge. Examinees take different tests, how can they have comparable scores? How can everyone get about 50% right, yet people get different scores? Today, CATs are fairly widespread with major testing programs like GRE and the uniform CPA exam 7) Can test time really be vastly reduced while boosting reliability? Partially. It turns out that the most informative items are also the ones which take the most testing time (You spend little time on items which you know the answer or which you need to guess). So, a 30 item CAT may ave similar reliabilities as compared to a fixed, 100 item fixed test but it might take longer than 33% of the time. Also boosting reliability is definitely possible but it depends upon having many good (discriminating) items. 8)What preparation is needed for creating a CAT? The CAT cannot be administered nor scored unless all the items have been scaled onto the same metric using IRT. Therefore, the items must be pretested on a large sample prior to the initial CAT admin. A common tactic is to create several conventional forms and administer them as fixed forms for a longer period of time in order to gather a sufficient sample to calibrate the items. At some point, the program "turns on" the adaptive algorithm after scaling the items and loading the IRT parameter data into the system. Thereafter, new items must be seeded into many operational administrations without being scored eventually being calibrated and merged with the operational items. 9) Barely adaptive tests? Operational CATs are constrained in many ways, not just for security reasons. Well-developed tests have content specs to typically only items of a certain content are sought in step 2. For some tests, the surface characteristics of the items are also controlled. So e.g., a CAT algorithm might be seeking informative items from content area5 subarea B which have hispanic or asian sounding names with at least one female and no more than one male. In such cases, fairly complex optimization routines must be employed to find eligible items. As Martha Stocking's quip implies, these constraints severely hamper the ability of the CAT to tailor the test and thus reduce the measurement precision of the CAT.

PSYCHOMETRICS - DIF Previous comps question: "What are the IRT methods of DF detection?"

3 IRT methods: *Pros of IRT methods over CTT methods: 1) ICCs and item parameters are not dependent on the sample (subpop invariant) 2) When the correct model is applied to fit the data, IRT methods do not confound DIF with impact (which is diffs in the means of the focal and reference group) 1)Lord's Chi-square A) Lord 1980 proposed a chi-square stat for testing the null that the item parameters are equal in the two (focal and reference pops) given the sample based item parameter estimates B) Also recommended that all item parameters be estimated first using the combined sample of focal group and reference group examinees C) The "a and "b" parameters are then reestimated separately for the focal and reference group, with the "c" parameters fixed at values obtained from the combined sample analsis D) Then one only needs to test for the equality of the "a" and "b" parameters across the focal and reference groups for each item PROS: All the standard IRT DIF strengths, easy to use and interpret, too powerful, meaningful effect - assesses difference in "a" and "b" parameters CONS: The method compares the item parameters directly while it is diffs in the ICCs that cause trouble (tests the parameters while DIF is defined as diffs in the ICCs and not the parameters. The chi square is known to be very sensitive to sample size. Does not distinguish between uniform and non-uniform DIF. With chi-square method, if we reject the null then there is DIF. However we are hoping to NOT reject the null (that there is no DIF). This violates the the concept of hypothesis testing which says that you can reject the null or FTRN but you cannot find the null to be true. 2) Area Measures A) Assesses DIF by measuring the area between the focal group's and the reference group's ICCs B) If there is nonuniform DIF, one must be careful in computing the area between the two ICCs because if the sign of the area is positive at one point and negative at another point, the total area in absolute terms could be less than the two areas combined or the two areas could cancel each other out. C) 2 diff measure of area: Signed area (gives info about uniform DIF) and unsigned area (give info about nonuniform DIF - use absolute values) are defined to address this issue D) Raju 1988 offered formulas for computing the exact signed and unsigned area measures and recommended a statistical sig test for the SA and UA estimate E) When the groups are on a common metrics, and the two ICCs are identical, there is no DIF F) An advantage to this method is that it gives an index as well as a sig test for DIF analysis **PROS: Compares ICCs directly, provides measure of uniform and non uniform DIF, expresses a meaningful effect (area between the ICCs) **CONS: Method doesn't account for compensatory and noncompensatory DIF measures. Area between ICCs isn't that understandable. Guessing c parameters can never be compared because unless the c parameters are the same, the area between the two ICCs is infinite 3) Raju's DFIT framework A) Raju et all proposed a framework for assessing diff fxning of items and tests (DFIT) B) Given a person's score on the underlying construct and the parameters of the focal and reference groups, two scores are calculated for each person: one score using the focal group item parameters and another using the reference group parameters. *If the two scores are the same, then the item parameters are the same for all values of that standing on the trait *The true score on the test is the sum of true scores on each items in the test *Non compensatory DIF (NCDIF) index is defined at the item level and represents the avg squared diff between the focal and refernce group's item level scores (assumes that all other items have no DIF when you are testing the specific on item (this is the main dff between NCDIF and CDIF) *DTF is defined at the test score level and is the avg squared diff between the reference and focal groups true score at the test level. Can sum up item diff scores to arrive at the test true score, if they are sig diff, then you have DTF *Compensatory DIF is also defined as the item level, but the sum of these indices is equal to the DTF index, so researchers can decide which items to delete to make the test index nonsignificant. CDIF doesn't make the assumption that DIF only exists with the item that you are testing. PROS: Encompassess DTF as well as DIF and includes compensatory as well as non-compnesatory DIF measures CONS: Traidtionally the critical values were heruistic based, fairly new compared to other methods, sometimes difficult to interpret

METHODS - MANOVA Previous Comps Question: 1) What are the assumptions of MANOVA, how can you evaluate whether the assumptions have been met, and what should be done if the assumptions have been violated?

ASSUMPTIONS OF MANOVA* *1) Observations are independent of one another.* MANOVA is NOT robust when the selection of one observation depends on selection of one or more earlier ones, as in the case of before-after and other repeated measures designs. However, there does exist a variant of MANOVA for repeated measures designs. *2) The IV (or IVs) is categorical.* *3) The DVs are continuous and interval-level.* *4) Low measurement error of the covariates.* The covariate variables are continuous and interval-level, and are assumed to be measured without error. Imperfect measurement reduces the statistical power of the F test for MANCOVA and for experimental data, there is a conservative bias (increased likelihood of Type II errors: Thinking there is no relationship when in fact there is a relationship). As a rule of thumb, covariates should have a reliability coefficients of .80 or higher. *5) Equal group sizes.* To the extent that group sizes are very unequal, statistical power diminishes. SPSS adjusts automatically for unequal group sizes. In SPSS, METHOD=UNIQUE is the usual method. *6) Appropriate sums of squares.* Normally there are data fro every cell in the design. e.g., 2-way ANOVA with a 3-level factor and a 4-level factor will have 12 cells (groups). But if there are no data for some of the cells, the ordinary computation of sums of squares ("Type III" is the ordinary, default type) will result in bias. When there are empty cells, one must ask for "Type IV" sums of squares, which compare a given cell with averages of other cells. In SPSS, Analyze, General Linear Model, Univariate; Click Model, then set "Sums of Squares" to "Type IV" or other appropriate type depending on one's design: *1) Type I.* Used in hierarchical balanced designs where main effects are specified before first-order interaction effects, and first-order interactions effects, and first-order interaction effects are specified before second-order interaction effects, etc. Also used for purely nested models where a first effect is nested within a second effect, the second within a third, etc. And used in polynomial regression models where simple terms are specified before higher-order terms (ex., squared terms). *2) Type II.* Used with purely nested designs which have main factors and no interaction effects, or with any regression model, or for balanced models common in experimental research. *3) Type III.* The default type and by far the most common, for any models mentioned above and any balanced or unbalanced model as long as there are no empty cells in the design. *4) Type IV.* Required in any cells are empty in a balanced or unbalanced design. This would include all nested designs, such as Latin square design. *7) Adequate sample size.* At a minimum, every cel must have more cases than there are DVs. With multiple factors and multiple dependents, groups sizes fall below minimum levels more easily than in ANOVA/ANCOVA. *8) Residuals are randomly distributed.* *9) Homoscedasticity (homogeneity of variances and covariances.* Within each group formed by the categorical IVs, the variance of each interval dependent should be similar as tested by Levene's test, below. Also for each of the k groups formed by the IVs, the covariance between any two DVs must be the same. When sample sizes are unequal, tests of group differences (Wilks, Hotelling, Pillai-Bartlett, GCR) are NOT robust when this assumption is violated. Pillai-Bartlett trace found to be more robust than the alternatives when this assumption was violated but sample sizes of the groups were equal (Olson, 1976) a) Box's M: Box's M tests MANOVA's assumption of homoscedasticity using the F distribution. If p(M) < .05, then the covariances are significantly different. Thus we want M not to be significant, rejecting the null hypothesis that the covariances are not homogenous. That is, the probability value of this F should be greater than .05 to demonstrate that the assumption of homoscedasticity is upheld. Note, however, that Box's M is extremely sensitive to violations of the assumption of normality, making the Box's M test less useful than might otherwise appear. For this reason, some researchers test at the p = .001 level, especially when sample sizes are unequal. b) Levene's Test: SPSS also outputs Levene's test as part of MANOVA. If Levene's test is sig., then the data fail the assumption of equal group variances. *10) Homogeneity of regression.* The covariate coefficients (the slopes of the regression line) are the same for each group formed by the categorical variables and measured on the dependents. The more this assumption is violated, the more conservative MANCOVA becomes (the more likely it is to make Type I errors - accepting a false null hypothesis). When running a MANCOVA model in SPSS, include in the model options the interactions between the covariate(s) and each IV - any significant interaction effects indicate that the assumption of homogeneity of regression coefficients has been violated. See the discussion in the section on testing assumptions. *11) Sphericity.* In a repeated measures design, the univariate ANOVA tables will not be interpreted properly unless the variance/covariance matrix of the DVs is circular in form (see Huynh and Mandeville, 1979). A spherical model implies that the assumptions of multiple univariate ANOVA is met, that the repeated contrasts are uncorrelated. When there is a violation of this assumption, the researcher must use MANOVA rather than multiple univariate ANOVA tests. a) Bartlett's and Mauchly's Tests of Sphericity. When the test is sig., the researcher concludes that the sphericity assumption is violated. *12) Multivariate normal distribution.* For purposes of sig. testing, variables follow multivariate normal distributions. In practice it is common to assume multivariate normality if each variable considered separately follows a normal distribution. MANOVA is robust in the face of most violations of this assumption if sample size is not small (e.g., < 20).******** *13) No outliers.* MANCOVA is highly sensitivity to outliers in the covariates. *14) Covariates are linearly related or in a known relationship to the dependents.* The form of the relationship between the covariate and the dependent must be known and most computer programs assume this relationship is linear, adjusting the dependent mean based on linear regression. Scatterplots of the covariate and the dependent for each of the k groups formed y the independents is one way to assess violations of this assumption. Covariates may be transformed (ex., log transform) to establish a linear relationship. *15) Does not assume linear relationships; can handle interaction effects.*

PSYCHOMETRICS - G Theory Outline Previous comps question: "Provide overview of G Theory"

It may help to think of G theory as ANOVA of test scores. The question G theorists have to ask is "What factors affect measurement?" and then they need to collect data on these factors and analyze them in aG theory framework. The G theory framework relies heavily on ANOVA but lends its own terminology to many aspects. ANOVA reminder: Recall that ANOVA involves factors with several levels. The analysis aspect of ANOVA is the partitioning of variance into main effects of factors, ineractions of factors, and error. These variance components are "sums of squares" Som edetails of the calculation make specific assumptions about the factors. Fixed factors are those in which our interests are descriptive (rather tan inferential). Random factors are those in which our interests are inferential. Typ III sums of squares is the kind preferred by many statisticians for random factors. Each main effect and interaction has a number of degrees of freedom. Mean squares are simply the sums of squares divided by the degrees of freedom. FACET: A factor that is hypothesized / feared to affect measurement; a factor in ANOVA CONDITION: A level of a facet; also called a level or condition in ANOVE UNIVERSE OF GENERALIZABILITY: The situations, people items, etc. to which you wish to generalize (or to which you can generalize). Essentially what we talk about as populations when we discuss sampling. Not that there may well be many universes to which you wish to generalize. UNIVERSE OF ADMISSIBLE OBSERVATIONS: All possible conditions for a given facet. GENERALIZABILITY (G) STUDY: A study of facets to determine generalizability, usually the responsibility of the test developer DECISION (D) STUDY: A study of generalizability to a specific purpose (to determine the feasibility of using the test for that purpose, or to engineer a test to a purpose); usually the responsibility of the test user RANDOM FACET: A facet which is arandom variable, i.e., the conditions of the facet have been randomly selected from the universe of all possible (e.g., person item) FIXED FACET: A facet where the conditions represent all possible (e.g., gender) TRADITIONAL OR COMMON FACETS: 1) Person: the peopl ein your sapmle. An effect for this facet means that your people do not all have the same observed score on the instrument. 2) Item: Item items or takss in your measurement devise. An effect for this facet means that your items do not all have the same difficulty 3) Occasion: The times that measurement occurred. An effect for this facet means that the test means differ across occasions. 4) Rater: The human grade or judger if there is on. AN effect for this indicated that the raters did not produce the same mean ratings. KEY CONCEPTS: -Universe score: In CTT terms, this is basically the true score on the universe of admissible test items (note that there may be many universes) -Variance component; the partitioned variance estimate for a given facet, which can be calculated from the mean squares of an ANOVA of the study data -Relative decision: A situation in which relative standing of the peopl eor objects is that basis of using a test (e.g., taking the top job candidates, selecting the most at-risk kids -ABsolute decision: A sitaution in which a cut-score is used to classify people or objects -Generalizbility coefficient: Analogous to (someimteims exctly the same as) a reliability (the universe score variance over the observed score variance) but because it is calculated with reference to a universe it is interpreted as tdtw scores on a measure may be generalized to that universe.

METHODS - Describe the computation of loadings in PFA

The factor loadings (also called component loadings in PCA) are the correlation coefficients between the variables (rows) and factors (columns). Analogous to Pearson's r, the squared factor loading is the percent of variance in that variable explained by the factor. To get the percent of variance in all the variables accounted for by each factor, add the sum of the squared factor loadings for that factor (column) and divide by the number of variables. (Note the number of variables equals the sum of their variances as the variance of a standardized variable is 1). This is the same as dividing the factor's eigenvalue by the number of variables.

METHODS - Advantage of using CIs and interpretation of CI

*Advantage of using CIs* 1) Provides info on magnitude of effect, not just sig. 2) Width provides info on precision 3) Provided better basis for comparing across studies because gives info on both magnitude and precision *Interpretation of CI* 1) A CI is developed around a sample result that is thought to theoretically include the population N value (1-a)% of the time in repeated samples take from 1 population.

METHODS - Compare / contrast PFA, PCA, and CFA. When should each be used?

*Factor analysis* is a two-part application / technique that 1) reduces the number of variables and 2) detects structure in the relationships between variables, i.e. classifies variable. Therefore, factor analysis is applied as a data reduction or structure detection method (Term factor analysis was first introduced by Thurstone, 1931) 1) Principle Components Model: By far the most common form of factor analysis, PCA seeks a linear combination of variables such that the maximum variance is extracted from the variables. It then removes this variance and seeks a second linear combination which explains the maximum proportion of the remaining variance, and so on. This is called the prcincipal axis methid and results in orthogonal (uncorrelated facors) PCA analyzes total (common and unique) variance. Example, component 1 increases standard variance, but then so does component 2. Results in uncorrelated factors. 2) Principle Factor Model: Also call principal axis factoring (PAF), and common factor analysis, PAF is a form of factor analysis which seeks the least number of factors which can account for the common variance (correlation) of a set of variables, whereas the more common pCA in its full form seeks the set of factors which can account for all the common and unique (specific plus error) variance in a set of variables. PAF/PFA uses a PCA strategy but applies it to a correlation matrix in which the diagonal elements are not 1's, as in PCA, but iteratively-derived estimates of the communalities (R-squared of a variable using all factors as predictors). Example, only considers common variance and seeks out least # of variables which can explain the common variance. 3) Exploratory Factor Analysis: NO a priori theory of what variables are related to a specific factor. *Associated with theory development. The question in EFA is: What are the underlying processes that could have produced correlations month these variables?* a) Basically, EFA seeks to uncover the underlying structure of a relatively large set of variables. The researcher's a priori assumption is that any indicator may be associated with any factor. This is the most common form of actor analysis. There is no prior theory and one uses factor loadings to intuit the factor structure of the data. 4) Confirmatory Factor Analysis: A priori hypotheses about relationship of variable w/ factor. *Associated with theory testing. The question in CFA is: Are the correlations among variables consistent with a hypothesized factor structure?* a) Basically, CFA seeks to determine if the number of factors and the loadings of measured (indicator) variables on them conform to what is expected on the basis of pre-established theory. Indicator variables are selected on the basis of prior theory and factor analysis is used to see if they load as predicted on the expected number of factors. The researcher's a priori assumption is that each factor (the number and labels of which may be specified a priori) is associated with a specific subset of indicator variables. A min. requirement of CFA is that one hypothesizes beforehand the number of factors in the model , but usually also the researcher will posit expectations about which variables will load on which factors (Kim and Mueller, 1978). The researcher seeks to determine, for instance, if measures created to represent latent variable really belong together. b) One uses PAF rather than PCA as the type of factoring. This method allows the researcher to examine factor loadings of indicator variables to determine if they load on latent variables (factors) as predicted by the researcher's model.

METHODS - Follow-up Procedures for MANOVA Previous comps question: 1) After obtaining a significant F-test in a MANOVA, what follow-up procedures can be used to more fully explore the pattern of results?

*STEP 1 - SIGNIFICANCE TESTS* 1) *F-test.* The omnibus or overall F test is the first of the two-step MANOVA process of analysis. The F test appears in the "Tests of Between-Subjects Effects" table of GLM MANOVA output in SPSS and answers the question, "Is the model significant for each dependent?" There will be an F significance level for each dependent. That is , the F test tests the null hypothesis that there is no difference in the means of each DV for the different groups formed by categories of the IVs. 2) *Multivariate tests* in contrast, answer the question, "is each effect significant?" That is, where the F tests focuses on the dependents, the multivariate tests focus on the independents and their interactions. These tests appear in the "Multivariate Tests" table of SPSS output. The multivariate formula for F is based not only on the sums or squares between and within groups, as in ANOVA, but also on the sum of the crossproducts - that is, it takes covariance into account as well as group means. There are four leading multivariate tests of group differences. a) *Hotelling's T-square* is the most common, traditional test where there are two groups formed by the IVs. Note one may see the related statistic, Hotelling's Trace. To convert from the Trace coefficient to the T-square coefficient, multiply the trace coefficient by (N-g), where the N is the sample size across groups and g is the number of groups. The T-square result will still have the same F value, degrees of freedom and sigificance level as the Trace coefficient. b) *Wilks' lambda U* This is the most common, traditional test where there are MORE than two groups formed by the IVs. Wilks' lambda is a multivariate F test, akin to the F test in univariate ANOVA. IT is a measure of the difference between groups of the centroid (vector) of means on the IVs. The smaller the lambda, the greater the differences. The Bartlett's V transformation of lambda is then used to compute the significance of Lambda. Wilks' lambda is used, in conjunction with Bartlett's V, as a multivariate significance test of mean differences in MANOVA, for the case of multiple interval dependents and multiple (>2) groups formed by the independent(s). The t-test, Hotelling's T, and the F tests are special cases of Wilks' lambda. c) *Pillai-Bartlett trace, V* Multiple discriminant analysis (MDA) is the part of MANOVA where canonical roots are calculated. Each significant root is a dimension on which the vector of group means is differentiated. The Pillai-Bartlett trace is the sum of explained variances on the discriminant variates, which are the variables which are computed based on the canonical coefficients for a given root. Olson (1976) found V to be the most robust of the four tests and is sometimes preferred for this reason. d) *Roy's greatest characteristic root (GCR)* is similar to the Pillai-Bartlett trace but is based only on the first (and hence, most important) root. Specifically, let lambda be the largest eigenvalue, then GCR = lambda/(1+lambda). Note that Roy's largest root is sometimes also equated with the largest eigenvalue, as in SPSS's GLM procedure (however, SPSS reports GCR for MANOVA). GCR is less robust than the other tests in the face of violations of the assumption of multivariate normality. *STEP 2 - POST-HOC TESTS* *The second step in MANOVA is that if the overall F-test shows the centroid (vector) of means of the DVs is not the same for all the group formed by the categories of the IVs, post-hoc univariate F-tests of group differences are used to determine just which group means differ significantly from others.* This helps specify the exact nature of the overall effect determined by the F-test. Pairwise multiple comparison tests test each pair of groups to identify similarities and differences. Multiple comparisons procedures and post hoc texts are discussed more extensively in the corresponding section under ANOVA. 1) Bonferroni adjustment. When there are many dependents, some univariate tests might be significant due to chance alone. That is the nominal .05 level is not the actual alpha level. Researchers may adjust the nominal alpha level. Actual alpha = 1 - (1 alpha1)(1-alpha2)...(1-alphan), where alpha1 to alpha-n are the nominal levels of alpha for a series of post hoc tests. For instance, for a series of 4 tests at the nominal alpha lvel of .01, the actual alpha would be estimated to be 1-.99^4 = .039. One wants an actual adjusted alpha level of at least .05. a) If the Bonferroni is requested, SPSS will print out a table of "Multiple Comparisons" giving the mean difference in the dependent variable between any two groups (e.g., differences in test scores for any two educational groups). The significance of this difference is also printed, and an asterisk is printed next to differences significant at the .05 level or better. The Bonferroni method is preferred when the number of groups is small. 2) Tukey test. If the Tukey test is requested, SPSS will produce a similar table which is interpreted in the same way. The Tukey method is preferred when the # of groups is large. 3) Other tests. Methods when the assumption of homogeneity of variances is not met: SPSS provides these alternative methods not shown here: Games-Howell, Tamhanes T2, Dunnett's T3, and Dunnett's C.

PSYCHOMETRICS - Computer Adaptive Testing (CAT) Lecture Notes Previous comps question: "What are some weaknesses of traditional ("Fixed form") assessments?"

1) Any given person must typically answer many items where the prob of responding is either very high or very low; such items are uninformative - literally a waste of the test-taker's time 2) Tests almost invariably have bell-shaped information fxns that provide goos measurement for some test-takers while providing very poor measurement for others 3) Computer base testing (CBT) may solve some problems but it cannot solve these problems that are endemic to all fixed form assessments 4) CAT has the potential to solve both of these problems: Each test is individualized for the person so they do not have to answer many uninformative items and all tests are (potentially) equally informative because the tests are tailored to the individual

METHODS - Analyzing Curvilinear Data Previous comps question: "

** 1) a) b) c) 2) a) b) c) d) i) ii)

METHODS - Factor Analysis

METHODS - What is the mathematical model and computation procedure of PFA?

fill in AND CHECK Steps in running a factor analysis (PFA): 1) Start with an initial estimate of h^2 2) Compute R* (reduced correlation matrix) with initial communalities and extract the factors with eigenvectors/factors and loadings 3) Next, use loadings to recompute h^2 (h-squared) estimates 4) Does fit improve? If yes, repeat steps 2 - 4, compute new R* with new h^2 estimates, etc. If fit does not improve, stop. 5) Keep repeating until you reach convergence, fit no longer improves and that is the final solution

METHODS - MANOVA vs. Discriminant Analysis Previous comps question: "Define and distinguish between MANOVA and DA (Discriminant Analysis)."

Tabachnik and Fidell (2007) The goal of DA is to predict group membership from a set of predictors. e.g., can a differential diagnosis among a group of nondisabled children, a group of kids with learning disability, and a group with emotional disorder be made reliably from a set of psychological test scores? So, the IVs are the set of test scores, and the DV is the diagnosis-i.e., 3 groups. DA is MANOVA turned around. In MANOVA, we ask whether group membership is associated with statistically significant mean differences on a combination of DVs. 1) In MANOVA, the IVs are the groups and the DVs are the predictors 2) In DA, the IVs are the predictors and the DVs are the groups *Mathematically, MANOVA and DA are the same, although the emphases often differ* 1) The major question in MANOVA is whether group membership is associated with statistically significant mean differences in combined DV scores. In DA, the question is whether predictors can be combined to predict group membership reliably. In many cases, DA is carried to the point of actually putting cases into groups, in a process called CLASSIFICATION. Differences between MANOVA and DA: 1) Classification is a major extension of DA over MANOVA. 2) Interpretation of differences among the predictors. a) In MANOVA, there is frequently an effort to decide which DVs are associated with group differences, but rarely an effort to interpret the pattern of differences among the DVs as a whole. b) In DA, there is often an effort to interpret the pattern of differences among the predictors as a whole in an attempt to understand the dimensions along with the groups differ. Kinds of research questions in DA: 1) In DA primary goals are to find the dimension or dimensions along which groups differ and to find classification functions to predict group membership. The degree to which these goals are met depends on the choice of predictors. Typically, the choice is made either on the basis of theory about which variables should provide information on group membership, or on the basis of pragmatic concerns such as expense, convenience, or unobtrusiveness. Significance of prediction: 1) Can group membership be predicted reliably from a set of predictors? a) This question is identical to the question about "main effects of IVs" for a one-way MANOVA. 2) Number of significant discriminant functions: a) Along how many dimensions do groups differ reliably? 3) Dimensions of discrimination: a) How can the dimensions along which groups are separated be interpreted? Where are groups located along the discriminant functions, and how do predictors correlate with the discriminant functions? 4) Classification functions: a) What linear equation(s) can be used to classify new cases into groups? 5) Adequacy of Classification: a) Given classification functions, what proportion of cases is correctly classified. When errors occur, who are cases misclassified? 6) Effect size: a) What is the degree of relationship between group membership and the set of predictors? 7) Importance of predictor variables: a) Which predictors are most important in predicting group membership? Questions about importance of predictors are analogous to those of importance of DVs in MANOVA. Kinds of research questions in MANOVA: 1) The goal of research using MANOVA is to discover whether behavior, as reflected by the DVs, is changed by manipulation of the IVs. a) Main Effects of IVs: Holding all else constant, are mean differences in the composite DV among groups at different levels of an IV larger than expected by chance? b) Interactions among IVs: Holding all else constant, does change in the DV over levels of one IV depend on the level of another IV? c) Importance of DVs: If there are significant differences for one or more of the main effects of interactions, the researcher usually asks which of the DVs are changed and which are unaffected by the IVs.

METHODS - DISCRIMINANT ANALYSIS Previous comps question: "Discuss the research questions addressed by DA, the underlying statistical model, significance testing and follow-up procedures used with each method."

DA: The fundamental equation, the underlying statistical model, significance testing and follow-up procedures: *A DA will often produce multiple solutions (multiple eigenvalues). How would you determine the number of solutions to interpret? How does the existence of multiple solutions influence the interpretation of the result?* 1) The fundamental equations for testing the sig. of a set of discriminant functions are the same as for MANOVA. Variance in the set of predictors is partitioned into two sources: Variance attributable to differences between groups and variance attributed to differences within groups. Cross-products matrices are formed. Stotal=Sbg+Swg The total cross-products matrix (Stotal) is partitioned into a cross-products matrix associated with differences between groups (Sbg) and a cross-products matrix of differences within groups (Swg). Determinants for these matrices are computed. Then you do an F-test. IF obtained F exceeds critical F, we conclude that the groups can be distinguished on the basis of the combination of predictors. This is a test of overall relationship between groups and predictors. It is the same as the overall test of a main effect in MANOVA. In MANOVA, this result is followed by an assessment of the importance of the various DVs to the main effect. In DA, however, when an overall relationship is found between groups and predictors, the next step is to examine the discriminant functions that compose the overall relationship. The maximum number of discriminant functions is either 1) the number of predictors or 2) the degrees of freedom for group, whichever is smaller. Discriminant functions are like regression equation; a discriminant function score for a case is predicted from the sum of the series of predictors, each weighted by a coefficient. There is a one set of discriminant function coefficients for the first discriminant function, a second set of coefficients for the second discriminant function and so forth. Subjects get separate discriminant function scores for each discriminant function when their own scores on predictors are inserted into the equations. To solve for the standardized discriminant function core for the ith function, the following equation is used: Di = di1z1+di2z2+...+dipzp A standardized score on the ith discriminant function (Di) is found by multiplying the standardized score on each predictor z by its standardized discriminant function coefficient (di) and then adding the products for all predictors. Discriminant function coefficients are found in the same manner as are coefficients for canonical variates. DA is basically a problem in canonical correlation with group membership on one side of the equation and predictors on the other, where successive canonical variates are computed. In DA, di are chosen to maximize differences between groups relative to differences within groups. A canonical correlation is found for each discriminant function. Canonical correlations are found by solving for the eigenvalues and eigenvectors of a correlation matrix, in a process described. An eigenvalue is a form of a squared canonical correlation which, as is usual for squared correlation coefficients, represents overlapping variance among variables, in this case, between predictors and groups. Successive discriminant functions are evaluated for significance. If there are only two groups, discriminant function scores can be used to classify cases into groups. A case is classified into one group if its Di score is above zero and into the other group if the Di score is below zero. With numerous groups, use classification procedure described below: 1) To assign cases into groups, a classification equation is developed for each group. Classification equations are developed for each group. Data for each case are inserted into each classification equation to develop a classification score for each group for the case. The case is assigned to the group for which it has the highest classification score. In its simplest form, the basic classification equation for the jth group is: Cj=Cj0+cj1x1+cj2x2+...+cjpxp A score on the classification function for group (Cj) is found by multiplying the raw score on each preidcotr (x) by its associated classification function (cj), summing over all predictors, and adding a constant cj0. Classification procedures and Classification Accuracy (CHOOSING NOT TO STUDY THIS...NEVER EVEN HEARD OF IT AND NOT ON COMPS FROM PAST 5 YEARS)

METHODS - Logistic Regression Previous comps question: "Describe the logistic regression model" "How does log regression overcome the limitation of OLS regression when the outcome variable is dichotomous?"

*"Describe the logistic regression model"* Binomial (or binary) logistic regression is a form of regression which is used when the DV is a dichotomy and the IVs are of any type. Multinomial log regression exists to handle the case of DVs with more classes than two. When multiple classes of the DV can be ranked, then ordinal log regression is preferred to multinomial log regression. Continuous variables are not used as DVs in log regression. Unlike logit regression, there can be only one DV. Logistic regression can be used to predict a DV on the basis of continuous and/or categorial IVs and to determine the percent of variance in the DV explained by the IVs; to rank the relative importance of IVs; to assess interaction effects; and to understand the impact of covariate control variables. Logistic regression applies maximum likelihood estimation after transforming the DV into a logit variable (the natural log of the odds of the DV occurring or not). In this way, log regression estimates the probability of a certain event occurring. Note that logistic regression calculates changes in the log odds of the DV, not changes in the DV itself as OLS regression does. *"How does log regression overcome the limitation of OLS regression when the outcome variable is dichotomous?"* Logistic regression has many analogies to OLS regression: logit coefficients correspond to b coefficients in the logistic regression equation, the standardized logit coefficients correspon to beta weights, and a pseudo R-squared statistic is available to summarize the strength of the relationship. Unlike OLS regression, however, logistic regression does not assume linearity of relationship between the IVs and the DV, does not require normally distributed variables, does not assume homoscedasticity, and in general has less stringent requirements. It does, however, requre that observations are independent and that the IVs be linearly related to the logit of the DV. The success of the logistic regression can be assessed by looking at the classification table, showing correct and incorect classifications of the dichotomous, ordinal, or polytomous DV. Also, goodness of fit tests such as model chi-square are available as indicators of model appropraiteness as is the Wald statistic to test the sig. of individual IVs. *"Why is logistic regression popular?" In part, because it enables the researcher to overcome many of the restrictive assumptions of OLS regression. 1) Logistic regression does not assume a linear relationship between the DVs and the IVs. It may handle nonlinear effects even when exponential and polynomial terms are not explicitly added as additional IVs because the logit link function on the left-hand side of the logistic regression equation is non-linear. However, it is also possible and permitted to add explicit interaction and power terms as variables on the right-hand side of the logistic equation as in OLS regression 2) The DV need not be normally distributed (but does assume its distribution is within the range of the exponential family of distributions, such as normal, Poisson, binomial gamma). 3) The DV need not be homoscedastic for each level of the IVs; that is, there is no homogeneity of variance assumption: variances need not be the same within categories. 4) Normally distributed error terms are not assumed. 5) Logistic regression does not require that the IVs be interval. 6) Logistic regression does not require that the IVs be unbounded. *Traditional regression cannot be used for a prediction where the outcome variables is dichotomous" a) Outcome is bounded. The outcome variable has an upper and lower limit (1 and 0) and therefore the traditional regression will not model the true relationship. This means, that the linear relationship represented by the traditional regression equation is not appropriate. Linear model will give predictions outside the range and underestimate the relationship. Linear meaning a line, non-linear meaning an s-curve. i) *Linear models also underestimate the strength of the relationship; data lie nicely on s curve and when you draw straight line through the data, this gives impression that relationship is not strong b) Errors are not normally distributed. Errors will be bunched at the extremes. In the middle there might be normal distribution, but at the limits will not be. c) Heteroscedasticity. There are different variances at different levels of the predictor. Variance of proportion depends on proportion itself; var P = P(1-P)/N. The variance is greatest when proportion = .5 and gets smaller toward the extremes. This violation causes inaccuracy in the sig testing and Is of the traditional regression analysis. *Logistic regression solves these problems by:* A) Transforming the outcome variable. 1) Transforming the outcome variable using a nonlinear transformation. The transformed outcome is called a logit. Logit = ln(p/1-p) 2) Regression analysis is then based on the logit; each person has a logit 3) After the analysis is complete, the logti can then be transformed back into a proportion. P=(1/1-e^-logit) Create linear model by predicting logit; the regression model woud be logit'=b0+b1x... Using the logit lets us use a linear model. Logit make a dichotomous variable which requires a non-linear model into an interval variable which requires a linear model. The way logit transforms a dichotomy into a continuous variable is by the utilization of natural logarithms. As seen earlier, logit = ln(p/1-p); ln is a natural logarithm. ln takes the interval between 0 and 1 and stretches it to infinity. This solves for problems of dichotomies used within the traditional regression model. a) logit is actually defined in terms of odds, which are closely related to probabilities. b) odds and probabilities are correlated, however magnitude of the r-ships shown by odds and probabilities are not the same. c) In logistic regression, our outcome variables are actually the ln of the odds; logit = ln ((odds(y)) = P(y)/1-P(y) d) Distribution of odds is non=symmetrical, using ln(odds) makes distribution symmetrical - that's what is useful about using logit *IN CONCLUSION, DICHOTOMOUS OUTCOME VARIABLES DEFINED IN TERMS OF PROBABILITIES OF BEING IN A CERTAIN CATEGORY IS TRANSFORMED INTO A LOGIT, A LOGIT CAN THEN BE PREDICTED BY A LINEAR MODEL. LOGIT CAN THEN BE TRANSFORMED BACK INTO PROBABILITIES. LOGIT HAS A NON-LINEAR RELATIONSHIP WITH OUTCOME VARIABLE* B) Using the maximum likelihood estimation 1) Logit doesn't use OLS principle, but uses estimates based on max likelihood estimation 2) Both ways can be used within the traditional analysis to estimate beta weights but MLE is a more general approach, therefore it is more flexible and can be used for data that does not fit regular patterns. LOGISTIC REGRESSION MODEL Logit'(y)=b0+b1X... b0 is the predicted logit when X-0 b1 is the relationship between X and predicted logit. The slope of the relationship, magnitude of the relationship between X and predicted logit. Logit and probability of y is positively correlated. IF b1 is positive, then as X increases, probability of Y increases IF b1 is negative, then as X increases, probability of Y decreases IF b1 is significant, then X significantly predicts probability of Y Logit is the transformed probabilities of the dichotomous outcome, which is used within the model as the predicted value in order to use a linear relationship model during the prediction. Sign and significance carries from logit to probabilities but strengths is uncertain. When transforming logit into probabilities get S shapes forming. Linear relationship predicts logit, and as a result we get a non-linear model of the data. WITH THIS YOU CAN DO ANYTHING YOU CAN DO WITH A TRADITIONAL REGRESSION: HIERARCHICAL, INTERACTIONS, ETC. *How to evaluate the results of the model* TESTING LOGISTIC REGRESSION...build hierarchical logistic regression a) Enter sets of predictors into regression to see how much more prediction w/ each variable b) Sequentially build models c) Step 0 - logit' = b0 d) Step 1 - logit = b0+b1x1 At each step calculate a measure of lack of fit. How badly are we doing. *2log likelhiood (-2LL) this is a measure of where you predicted you would be and where you are. To evaluate whether model is sig., we use likelihood ration chi-square test *X^2=(b/SE(b)) df = 1; b = regression coefficient Another way to evaluate is to test specific predictors. Wald test to see whether an individual predictor is adding signiticance to prediction.ii *X^2=(b/SE(b)) df=1; b = regression coefficient

METHODS - Effect Size and Statistical Significance Previous comps question: 1) Explain the difference between effect size and statistical significance as methods for describing research results. What unique information does each provide? What measures of effect size are appropriate in situations where one would use a t-test, ANOVA, multiple regression, and MANOVA?

*Effect size (d)* Is a measure of the degree to which mu1 and mu2 differ in terms of the SD of the parent population. Effect size is a standardized, scale-free measure of the relative size of the effect of an intervention. It is particularly useful for quantifying effects measured on unfamiliar or arbitrary scales and for comparing the relative sizes of effects from different studies *What is the relationship between "effect size" and "significance"?* 1) Effect size quantifies the size of the difference between two groups and may therefore be said to be a true measure of the significance of the difference. 2) The statistical significance is usually calculated as a 'p-value' the probability that a difference of at least the same size would have arisen by chance, even if there really were no difference between the two populations. 3) There are a # of problems with using 'significance tests' in this way (see, for example, Cohen, 1994; Harlow et al 1997, Thompson, 1999). The main one is that the p-value depends essentially on two things: the size of the effect and the size of the sample. One would get a 'significant' result either if the effect were very big (despite having only a small sample) or is the sample were very big (even if the actual effect size were tiny). 4) It is important to know the statistical significance of a result, since without it there is a danger of drawing firm conclusions from studies where the sample is too small to justify such confidence. However stat. sig. does not tell you the most important thing: which is THE SIZE OF THE EFFECT. One way to overcome this confusion is to report the effect size, together with an estimate of its likely 'margin for error' or 'confidence interval.' Advantage of effect size: ***One of the main advantages of using effect size is that when a particular experiment has been replicated, the different effect size estimates from each study can easily be combined to give an overall best estimate of the size of the effect. This process of synthesizing experimental results into a single effect size estimate is know as meta-analysis. t-test effect size measures: 1) Cohen's d. The d coefficient, not perhaps the most popular effect size measure for t-test AND ANOVA, is computed as a function of differences in subgroup means by effect category. ANOVA effect size measures: ***Effect size coefficients based on standardized mean differences.*** 1) Cohen's d. The d coefficient, not perhaps the most popular effect size measure for t-test AND ANOVA, is computed as a function of differences in subgroup means by effect category. a) Computation. The group difference in means (e.g., the means of the tx and control groups) is divided by the pooled SD (the SD of the unstandardized data for all the cases, for all groups; put another way, the sample-seize weighted average of SDs for all groups) to provide a coefficient which may be used to compare group effects. In a two-variable analysis, d is the difference in group means (on y, the dependent) divided by the pooled SD of y. Computation of d becomes more complex for another ANOVA designs - Cortina and Nouri (2000) give formulas for n-way, factorial, ANCOVA, and repeated measures designs. In an ANOVA table, the effect size normally is placed at the bottom of each effect column. b) Equivalency. Cohen (1988) notes that correlation r = d/[(d^2+4)^.5]. Cortina and Nouri (2000) provide formulas for conversion of p values , F values, and t values to d. c) Interpretation. The larger the d (which may exceed 1.0), the larger the tx effect or effect of a factor. When Cohen's d is 0.0, 50% of the control cases are at or below the mean of the tx group (or above if the tx effect is negative), and there is 100% overlap in the values of the tx and control groups. *Effect size coefficients based on % of variance explained* 1) Eta-squared, also called the "correlation ratio" or the "coefficient of nonlinear correlation" is one of the most popular effect size measures for ANOVA. It is the % of total variance in the dependent variable accounted for by the variance between categories (groups) formed by the IVs a) Computation. Eta is thus the ration of the between-groups sums of squares to the total sum of squares. The between-groups sum of squares measures the effect of the grouping variable (that is, the extent to which the means are different between groups). In SPSS, select analyze, compare means, mean, click Options; select ANOVA table and ETA b) Interpretation. It can be said that eta is the % that prediction is improved y knowing the grouping variable(s) when the dependent is measured in terms of the square of the prediction error. *ETA is analogous to R-squared in regression analysis*. When there are curvilinear relations of the factor the dependent, eta-square will be higher than the corresponding coefficient of multiple correlation (R-squared).

METHODS - Hierarchical Moderator Regression Previous comps question: "A researcher believe that conscientiousness and general cognitive ability interact in the prediction of job performance ratings. How would you test for the interaction between these two continuous predictors using multiple regression? How does the interpretation of a significant interaction differ from the interpretation of the main effects of the two predictors? When a significant interaction is found, why is it inappropriate to interpret the main effects? What follow-up analyses should be conducted to further investigate a significant interaction?"

*MMR - Moderated Multiple Regression* 1) Create product vector 2) Hierarchical regression 3) Step 1: Main effects 4) Step 2: All 2-way interactions 5) Step 3: All 3-way interactions 6) Build model going from simple to complex, interpret going from complex to simple If have sig interaction, you should no interpret lower order effects. Why not? a) Statistically unstable, not invariant to linear transformations, the results depend completely on the scaling of the variable (if rescale variable, it will dramatically change the results which is conceptually not useful). Interaction effects are sometimes called moderator effects because the interacting third variable which changes the relation bt two original variables is a moderator variable which moderates the original relationship. e.g., the relation bt income and conservatism may be moderated depending on the level of education. 1) Interaction terms may be added to the model to incorporate the join effect of two variables (ex. income and education) on a DV (ex. conservatism) over and above their separate effects. One adds interaction terms to the model as crossproducts of the standardized IVs and/or dummy IVs typically placing them after the simple main effects IVs. Crossproduct interaction terms may be highly correlated (multicollinear) with the corresponding simples IVs in the regression equation, creating problems with assessing the relative importance of main effects and interaction effects. NOTE. Because of possible MC, it may well be desirable to use centered variables (where one has subtracted the mean from each datum) - a transformation which often reduces MC.

PSYCHOMETRICS - Definitions of common psychometric terms Previous comps questions: 1. Define "Item Characteristics Curve" aka ICC 2. Define "Item Information Function" aka IIF 3. Define "Test Information Function" aka TIF

1. Graphically displays the relationship between ability and the probability of getting an item correct. Also called Item Response Function (so ICC = IRF) 2. Displays the contribution an item makes to ability estimation along points among the ability continuum. It is defined for each and every item and it varies from person to person. - Information fxn for an item (Ii=Piqi) where: -Ii= item information fxn which varies by person's ability -Pi= (probability of getting an item right) -qi=1-Pi (prob of getting an item wrong) -Pi+qi=1 3. Equal to the sum of item information functions. An additive property of item information functions. It will be greater with more items that you have. -Ix(theta)= Sum of Piqi -The more ithems, the higher the TIF -SEM=(1/sqrt(Ix(theta)))...note, SEM varies by person, the more items, the higher the TIF, the lower the SEM, because TIF is kind of like reliability (the more items, the higher the TIF, the higher the reliability, the lower the SEM). ***If max each IIF, max TIF, min SEM (precision of est.) ......in 2PL model, the higher the a, the higher the IIF

PSYCHOMETRICS - c parameter What are some practical ways to reduce guessing parameter? (name 7)

It is easy to inadvertently include clues in your test items that point to the correct answer, help rule out incorrect alternatives or narrow the choices. Any such clue would decrease your ability to distinguish students who know the material from those who do not, thus, providing rival explanations. Below are some common clues students use to increase their chance of guessing and some advice on how to avoid such clues. 1) Keep the grammar consistent between stem and alternatives 2) Avoid including an alternative that is significantly longer than the rest 3) Make all distractors plausible 4) Avoid giving too many clues in your alternatives 5) Do not test students on material that is already well-learned prior to your instruction 6) Limit the use of "all of the above" or "none of the above" 7) Include 3 or 4 alternatives for MC items - Obvs, if you have only 2 alternatives then the chance for guessing increases significantly as there will be a 50% chance of getting the item correct just by guessing. If you include 5+ alternatives then the item becomes increasingly confusing or requires too much processing or cognitive load. Additionally, as the # of distractors increases, the likelihood of including a bad distractor significantly increases. Thus, research finds that providing 3-4 alternatives leads to the greatest ability to distinguish between those test-takers who understand the material and those who do not (Haladyna et al 2002; Taylor 2005)

METHODS - Thurstone's Principle of Simple Structure Previous Comps questions: 1) Describe Thurstone's (1947) principle of simple structure, as applied in factor analysis. Why is it important? How is the principle operationalized through factor rotation methods? 2) What are the rotation methods?

T & F (2007) *Thurstone's (1947) principle of simple structure* 1) IF simple structure is present (and factors are not too highly correlated), several variables correlate highly with each factor and only one factor is correlated highly with each variable. a) In other words, the columns A, which defined factors via variables, have several high and many low values; b) ...while the rows of A (which define variables via factors), have only one high value... c) ...rows with more than one high correlation correspond to variables that are said to be complex because they reflect the influence of more than one factor. It is usually best to avoid complex variables because they many interpretation of factors more ambiguous Usually, rotation is necessary to achieve simple structure, if it can be achieved at all. 1) Oblique rotation does lead to simpler structures in most cases, but it is more important to note that oblique rotations result in correlated factors, which are difficult to interpret. Simple structure is only one of several sometimes conflicting goals in factor analysis. ROTATION METHODS: Rotation serves to make the output more understandable and is usually necessary to facilitate the interpretation of factors. The sum of eigenvalues is not affected by rotation by rotation will alter the eigenvalues (and % of variance explained) or particular factors and will change the factor loadings. Since alternative rotations may explain the same variance (have the same total eigenvalue) but have different factor loadings, and since factor loadings are used to intuit the meaning of factors, this means that different meanings may be ascribed to the factors depending on the rotation - a problem often cited as a drawback to factor analysis due to *SUBJECTIVE JUDGMENT REQUIRED*. If factor analysis is to be used, the researcher may wish to experiment with alternative rotation methods to see which leads to the most interpretable factor structure. TYPES *Orthogonal Rotations* do not allow factors to be correlated. Whereas *Oblique Rotations* discussed below, allow the factors to be correlated, and so a factor correlation matrix is generated when oblique is requested. Normally, however, an orthogonal method such as varimax is selected and no factor correlation is produced as the correlation of any factor with another is zero. *1) NO ROTATION* is the default in SPSS, but it is a good idea to select a rotation method, usually varimax. The original unrotated principal components solution *maximizes the sum of squared factor loadings, efficiently creating a set of factors which explain as much of the variance in the original variables as possible.* The amount explained is reflected in the sum of the eigenvalues of all factors. However, unrotated solutions are hard to interpret because variables tend to load on multiple factors. *2) VARIMAX ROTATION* is an *orthogonal rotation* of the factor axes to maximize the variance of the squared loadings of a factor (column) on all the variables (rows) in a factor matrix, which has the effect of differentiating the original variables by extracted factor. Each factor will tend to have either large or small loadings of any particular variables. A varimax solution yields results which make it as easy as possible to identify each variable with a single factor. This is the most common rotation option. *THIS IS THE OPERATIONALIZATION OF THURSTONE'S FIRST PRINCIPLE, and MAXIMIZES INTERPRETABILITY OF FACTORS* *3) QUARTIMAX ROTATION* is an *orthogonal rotation* which minimizes the number of factors needed to explain each variable. This type of rotation often generates a general factor on which most variables are loaded to a high or medium degree. Such a factor structure is usually not helpful to the research purpose. *THIS IS THE OPERATIONALIZATION OF THURSTONE'S SECOND PRINCIPLE, and MAXIMIZES INTERPRETABILITY OF VARIABLE* *4) EQUIMAX ROTATION* is an *orthogonal rotation* and a compromise between Varimax and Quartimax criteria, but is less stable *5) DIRECT OBLIMIN ROTATION* is an *oblique or non-orthogonal rotation* which is the standard method when one wishes to allow the factors to be correlated. This will results in higher eigenvalues but diminished interpretability of the factors. *6) PROMAX ROTATION* is an *oblique or non-orthogonal rotation* which is computationally faster than the direct oblimin method and therefore is sometimes used for very large datasets.

METHODS - Relationship between Correlation Coefficient and Regression Coefficient Previous comps question: "The relationship between a test score and job performance can be represented using either a correlation coefficient or a regression coefficient from a simple regression analysis. Describe the relationship between these two statistics when 1) evaluating the magnitude of the relationship and 2) comparing results across differences samples (e.g., minority vs majority group)?" "We can use the technique of correlation to test that statistical significance of the association. IN other cases we use regress analysis to describe the relationship precisely by means of an equation that have predictive values."

The *regression coefficient, b,* is the average amount the DV increases when the IV increases one unit and other IVs are held constant. Put another way, the b coefficient is the slope of the regression line: the larger the b, the steeper the slope, the more the DV changes for each unit change in the IV. The b coefficient is the is the unstandardized simple regression coefficient for the case of one IV. When there are two or more IVs, the b coefficient is a partial regression coefficient, though it is common simple to call it a "regression coefficient" also. In SPSS, Analyze, Regression, Linear, click the Statistics button; make sure Estimates is checked to get the b coefficients (the default). *b coefficients compared to partial correlations coefficients.* The b coefficient is a semi-partial coefficient, in contrast to partial coefficients as found in partial correlation. The partial coefficient for a given IV removes the variance explained by control variables from both the IV and the DV, then assesses the remaining correlation. In contrast, a semi-partial coefficient removes the variance only from the IV. That is, where partials coefficients look at the variance of the DV, semipartial coefficients look at the variance in the DV after variance accounted for by control variables is removed. Thus, the b coefficients, as semi-partial coefficients, reflect the unique (INDEPENDENT) contributions of each IV to explaining the total variance in the DV. This is discussed further in the section on partial correlation. *Correlation* is a bivariate measure of association (strength) of the relationship between two variables. IT varies from 0 (random relationship) to 1 (perfect linear relationship) or -1 (perfect negative linear relationship). It is usually reported in terms of its square (little r-squared), interpreted as percent of variance explained. For instance is little r-squared is .25, then the IV is said to explain 25% of the variance in the DV. iN SPSS, select Analyze, Correlate, Bivariate, check Pearson. There are several common pitfalls in using correlation. Correlation is symmetrical, not providing evidence of which way causation flows. If other variables also cause the DV, then any covariance they share with the given IV in a correlation may be falsely attributed to that IV. Also, to the extent that there is a nonlinear relationship between the two variables being correlated, correlation will understate the r-ship. Correlation will also be attenuated to the extent there is measurement error, including use of sub-interval data or artificial truncation of the range of the data. Correlation can also be a misleading average if the r-ship varies depending on the value of the IV )"lack of homoscedasticity"). And, of course, atheoretical or post-hoc running of many correlations runs the risk that 5% of the coefficients may be true by chance alone. 1) *Correlation* is the covariance of standardized variables - that is, of variables after you make them comparable by subtracting the mean and dividing by the SD. Standardization is built into correlation and need not be requested explicitly in SPSS. Correlation is the ratio of the observed covariance of t2 standardized variables, divided by the highest possible covariance when their values are arranged in the best possible match by order. When the observed covariance is as high as the possible covariance, the correlation will have a value of 1, indicating perfectly matched order of the two variables. A value of -1 is perfect negative covariation, matching the highest positive values of one variable with the highest negative values of the other. A correlation value of 0 indicates a random relationship by order bt the two variables. 2) Pearson's r: This is the usual measure of correlation, sometimes called product-moment correlation. Pearson's r is a measure of association which varies from -1 to +1, with 0 indicating no relationship (random pairing of values) and 1 indicating perfect relationship, taking the from "The more the x, the more the y, and vice verse". A value of -1 is a perfect negative relationship, taking the form "The more the x, the less the Y, and vice versa"

PSYCHOMETRICS - DIF Previous comps question: "In 1976, the Golden Rule Insurance Co. sued in the state of Illinois over alleged racial bias in the licensing test for insurance agents because virtually no African-American agents passed the exam, which was quite hard (the passing rate was only 31%). In an out-of-court settlement in 1984, the test developer agreed that any item for which correct answer rates for white and African-American test takers differed by more than 15% would be considered biased. This question has 3 parts (not necessarily of = importance) 1) Define DIF as it is commonly defined in the psychometric literature 2) Define at least one method for assessing DIF 3) Was the Golden Rule settlement consistent or inconsistent with the generally accepted definition of DIF? Why or why not?"

Too much to address - answer if time. DIF to be split out in following items But...for #3 Nahren said "DIF v. DTF, depends if DIF when considering = thetas, and that the golden rule seemed more like impact" See the learning objectives for DIF below: 1) Understand terminology of DIF 2) Be able to define DIF and distinguish it from fairness, bias, and impact 3) Understand the concept of DIF and give some examples of DIF; explain at least one model of why DIF occurs 4) Explain and distinguish the following DIF techniques: Mantel-Haenszel chi-square, logistic regression, lord's chi-square, area measures, DFIT. Explain pros and cons of each 5) Explain statistical issues associated with DIF: a) What is FWER? What does it relate to DIF? At the alpha=.05 level, what is the chance of not making a Type I error after assessing a 50-item test using 3 different EEOC groupings? Explain in plain English. b)Explain the standard power analyses for common DIF statistics (hint: this is a trick question) c) Why is accepting the null a bad staistical practice? In light of this issue, what does a DIF study show? d) What is the problem of double-jeopardy when multiple focal groups exist? How is this dealt with in practice? e) In terms of DIF, waht is the distinction between an effect size and a significance test? 6) How are DIF measures generally used in practice? 7) How is DIF a circular measure of bias? 8) How can you have many DIF items and yet not have DTF? 9) Given some situation, use your knowledge of DIF/DTF to provide some advice to a team of non-psychometric SMEs

METHODS - Missing Data Previous comps question (Fall 2017) "Distinguish among the three different types of missing data (Rubin, 1976), including the specific conditions that define each type of missing data mechanism. Discuss the limitations of listwise and pairwise deletion, and describe at least one alternative method for handling missing data that avoids these limitations"

** 3 DIFF TYPES OF MISSING DATA 1) - Specific conditions: - 2) - Specific conditions: - 3) - Specific conditions: - LIMITATIONS OF LISTWISE & PAIRWISE DELETION 1) Listwise deletion limitations - - -Alternative method for handling missing data that avoids this limitations: 2) Pairwise deletion limitations - - -Alternative method for handling missing data that avoids this limitations:

PSYCHOMETRICS - Deep Dive into IIF (Item Information Fxns)

**In IRT, each item has an information fxn as does the entire test An IIF indicates the amount of information an item contains along the ability continuum *In 1PL & 2PL models, the amount of info an item provides is maximized around the item difficulty parameter. Items that are matched to the difficulty to examinee ability are maximally informative. ***In 3PL model, the max info occurs at trait level slightly below the item difficulty parameter. The effect of guessing lowers the amount of psychometric info an item provides. ****All else being equal, a 1PL or 2PL item will be more informative than a 3PL item. ****IIFs are additive across items to determine how well an entire set of items function as a latent-trait measure

METHODS - How can you screen for multicollinearity?

*1) Tolerance:* Most programs protect against MC by computing SMCs for the variables. SMC is the squared multiple correlation of a variable where it serves as DV with the rest of the variables as IVs in multiple correlation. IF the SMC is high, the variable is highly related to the others in the set and you have MC. Many programs convert the SMC values for each variable to tolerance (1-SMC) and deal with tolerance instead of SMC (T&F, 2007). a) Tolerance: is the reciprocal of VIF. Tells us how much of the variance in 1 IV, is independent of the other IVs. A commonly used rule of thumb is that tolerance values of .10 or less indicate that there may be a serious MC problem in the regression equation. b) If the tolerance is too low, the variable does not enter the analysis. Default tolerance levels range between .01 and .0001, so the MSCs are .99 to .9999 before variables are excluded *2) Variance Inflation Factor (VIF):* Provides an index of the amount that the variance of each regression coefficient is increased relative to a situation in which all the predictor variables are uncorrelated. A VIF is calculated for each term in a regression equation, excluding the intercept. A commonly used rule of thumb is that any VIF of 10 or more provides evidence of serious MC involving the corresponding IV. A VIF of 10 means that there is a 3.15 or slightly more than a threefold increase in SE of the beta weight relative to the situation of no correlation between any of the IVs (Cohen et al, 2003)

METHODS - Power p1 1) What is statistical power? Tabachnik & Fidell (2007) and Howell (2002)

*1) What is statistical power?* *a)* Power of a test is the probability of rejecting the null hypothesis when it is actually false (i.e., the alternate hypothesis is true). *b)* Where significance deals with Type I errors, power deals with Type II errors. A Type II errors is accepting a false null hypothesis (thinking you do not have a relationship when in fact you do). Power is 1-q, where q is the chance of making a Type II error. Social scientists often use the .80 level as a cutoff: There should be at least an 80% chance of not making a Type II error. This is more lenient than the .05 level used in significance testing. Leniency is justified on the grounds that greater care should be taken in asserting a relationship exists (as shown by significance < .05) than in failing to conclude that a relationship exists. Obviously, the .80 level of power is arbitrary and the researcher must set the level appropriate for his or her research needs. *c)* In general, there is a trade-off between significance and power. Selecting a stringent significance level such as .001 will increase the chance of Type II errors and thus will reduce the power of the test. However, if two types of significance tests show the same level of significance for given data, the test with the greater power is used. It should e noted than in practice many social scientists do not consider or report the power of the significance tests they use, though they should.

METHODS - Power p2 2) What factors influence the power of a statistical test? Tabachnik & Fidell (2007) and Howell (2002)

*2) What factors influence the power of a statistical test?* *a)* Issues of power are best considered in the planning state of a study where the researcher determines the required sample size. The researcher then estimates the anticipated effect (e.g., expected mean difference), the variability expected in assessment of the effect, the desired alpha level (ordinarily .05) and and the desired power (often .80). These four estimates are required to determine the necessary sample size. Failure to consider power in the planning stage often results in failure to find a significant effect (Tabachnick & Fidell, 2007). *b)* Power is a function of several variables (Murphy & Myors, 1998). Studies have higher levels of statistical power under the following conditions: *1)* *Sensitivity*: Studies are highly sensitive (tdtw sampling error introduces imprecision into the results of a study). Researchers might increase sensitivity by using better measures, or a study design that allows them to control for unwanted sources of variability in their data. *The simplest method of increasing the sensitivity of a study is to INCREASE THE SAMPLE SIZE (N)*. As N increases, statistical estimates become more precise and the power of statistical tests increases. *2)* *Size of Effect in the Population*: Effect sizes are large. Different treatments have different effects. It is easiest to detect the effect of a treatment if that effect is large (e.g., when treatments account for a substantial proportion of variance in outcomes). When treatments have very small effect sizes, these effects can be difficult to reliably detect. Power increases as effect size values increase. *3)* *Standards or criteria used to test statistical hypotheses*: Standards are set that make it easier to reject the null hypothesis. It is easier to reject the null when the significance criterion, or alpha level, is .05 than when it is .01 or .001. Power increases as the standard for determining significance becomes more lenient.

METHODS - Power p3 3) Given a simple experiment where participants are randomly assigned to treatment and control conditions, how would you evaluate the power of the statistical test? Tabachnik & Fidell (2007) and Howell (2002)

*3) Given a simple experiment where participants are randomly assigned to treatment and control conditions, how would you evaluate the power of the statistical test?* *a)* Assuming equal sample sizes...assume we wish to test the difference between two treatment and either expect that the difference in population means will be approximately 5 points or else are interested in finding a difference of at least 5 points. From past data you think that sigma is approximately 10. Then: d = (mu1 - mu2) / sigma This tells you the expected difference in SD between the two population means. Then you will investigate the power of the experiment w/ say 25 observations in each of the 2 groups, where n = number of cases in any one sample. Then you will check your value to see what your power is for the experiment. Then this will tell you the chance of actually rejecting the null if it is false. (see image for formula)

METHODS - Describe high partial or imperfect MC and how it manifests

*Absence of high partial MC* Where there is high but imperfect MC, a solution is still possible but as the independents increase in correlation with each other, the standard errors of the regression coefficients will become inflated. High MC does not bias the estimates of the coefficients, only their reliability. This means that it becomes difficult to assess the relative importance of the independent variables using beta weights. It also means that a small number of discordant cases potentially can affect results strongly. The importance of this assumption depends on the type of MC. In the discussion below, the term "independents" refers to variables on the right-hand side of the regression equation other than control variables. *1) MC among the independents:* This type of MC is the main research concern as it inflates standard errors and makes assessment of the relative importance of the independents unreliable. However, if sheer prediction is the research purpose (as opposed to causal analysis), it may be noted that high MC of the independents does not affect the efficiency of the regression estimates. *2) MC of independent construct components:* When two or more independents are components of a scale, index, or other construct, high intercorrelation among them is intentional and desirable. Ordinarily this is not considered "MC" at all, but collinearity diagnostics may report it as such. Usually the researcher will combine such sets of variables into scales or indices prior to running regression, but at times the researcher may prefer to enter them individually and interpret them as a block. *3) MC of crossproduct independents:* Likewise, crossproduct interaction terms may be highly correlated with their component terms. This is intentional and usually is not considered MC but collinearity diagnostics may report it as such. *4) MC of power term independents:* Power terms may be correlated with first-order terms. The researcher should center such variables to eliminate MC associated w/ the first-order variable's mean. This is particularly necessary when the mean is large. *5) MC among the controls:* High MC among the control variables will not affect research outcomes provided the researcher is not concerned with assessing the relative importance of one control variable compared to another. *6) MC of controls w/ independents:* This is not necessarily a research problem but rather MAY mean that the control variables will have a strong effect on the independents, showing the independents are less important than their uncontrolled relation with the dependent would suggest. *7) MC of independents or controls with the DV:* This is not a research problem unless the high correlation indicates definitional overlap, such that the apparent effect is actually a tautological artifact. Otherwise high correlation of the independents with the dependent simply means the model explains a great deal of the variance in the DV. High correlation of the controls will be associated with strong control effects on the independents. This type of high correlation ordinarily is not considered "MC" at all.

METHODS - Advantages and disadvantages of each approach (PCA vs PFA)

*Advantages of PCA and PFA: A) PCA* 1) Simpler model than other ML and PAF, better for summarizing / reducing variables. 2) Defining components as a summary of the data 3) Don't need to rely on h^2 or iterative procedure therefore less likely to get bizarre results 4) PC converge after 1 iteration 5) The PCs are ordered, with the first component extracting the most variance and the last component the least variance 6) The solution is mathematically unique and, if all components are retained, exactly reproduces the observed correlation matrix 7) Since the components are orthogonal, their use in other analyses may greatly facilitate interpretation of results *B) PFA* 1) Conceptually more meaningful, better for defining theoretical variables we are interested in. 2) It is widely used (and understood) and that it conforms to the factor analytic model in which common variance is analyzed with unique and error variance removed. *Disadvantages of PCA and PFA: A) PCA* 1) Not based on a theory of the relation between observed and latent variables 2) Psychometrics: Typically view latent variable as cause of observed variable 3) Observed variables are imperfect indicators of latent variables (i.e., measured w/ error) *B) PFA* 1) Can take more than 1 iteration to converge because we look at change in h^2 estimates w/ iterations 2) Sometimes not as good as other extraction techniques in reproducing the correlation matrix 3) Communalities must be estimated and the solution is, to some extent, determined by those estimates

METHODS - Assumption of Homogeneity of Error Variances Previous comps question: "Many statistical methods assume homogeneity of error variances, although the assumption differs slightly across methods. Describe the assumption in the context of ANOVA, Multiple Regression, and MANOVA. Why is this assumption necessary? How can the assumption be tested? What should be done if the assumption is violated?"

*In Multiple Regression (MR)* Homoscedasticity: The researcher should test to assure that the residuals are dispersed randomly throughout the range of the estimated dependent. Put another way, the variance of residual error should be constant for all values of the IV(s). if not separate models may be required for the different ranges. Also when the homoscedasticity assumption is violated "conventionally computed CIs and conventional t-tests for OLS estimators can no longer be justified (Berry, 1993). However, moderate violations of homoscedasticity have only minor impact on regression estimates (Fox, 2005). Non constant error variance can be observed by requesting a simple residual plot (a plot of residuals on the Y axis against predicted values on the X axis). A homoscedastic model will display a cloud of dots, whereas lack of homoscedasticity will be characterized by a pattern such as a funnel shape, indicating greater error as the DV increases. Nonconstant error variance can indicate the need to respecify the model to include omitted IVs Lack of homoscedasticity may mean: 1) There is an interaction effect between a measure iV and an unmeasured IV not in the model; OR 2) That some IVs are skewed while others are not. *HOW TO DEAL* 1) One usual method of dealing with heteroscedasticity is to use WEIGHTED LEAST SQUARES REGRESSION instead of OLS regression. This causes cases with smaller residuals to be weighted more in calculating the b coefficients. Square root, log, and reciprocal transformations of the DV may also reduce or eliminate lack of homoscedasticity. 2) NO OUTLIERS. Outliers are a form of violation of homoscedasticity. Detected in the analysis of residuals and leverage statistics, these are cases representing high residuals (errors) which are clear exceptions to the regression explanation. Outliers can affect regression coefficients substantially. The set of outliers may suggest/require a separate explanation. Some computer programs allow an option of listing outliers directly, or there may be a "casewise plot" option which shows cases more than 2 SD from the estimate. To deal with outliers, you may remove them from analysis and seek to explain them on a separate basis, or transformations may be used which tend to "pull in" outliers. These include the square root, logarithmic and inverse (x = 1/x) transforms. 3) The leverage statistic, h, also called hat-value, is available to identify cases which influence the regression model more than others. The leverage stat varies from 0 (no influence on the model) to 1 (completely determines the model). A rule of thumb is that cases with leverage under .2 are not a problem, but if a case has leverage over .5, the case has undue leverage and should be examined for the possibility of measurement error or the need to model such cases separately. In SPSS, the minimum, maxmimum, and mean leverage is displayed in the "Residuals Statistics" table when "Casewise Diagnostic" is checked under the staistics button in the regression dialog. Also, select Analyze, Regression, Linear; click Save; check leverage to add these values to your data set as an additional column. a) Data labels. Influential cases with high leverage can be spotted graphically. 5) Mahalanobis Distance is leverage time (n-1) where n is sample size. As a rule, the max. Mahalanobis distance should not exceed the critical chi-squared value with degrees of freedom equal to number of predictors and alpha = .001, or else outliers may be a problem in the data. The max. Mah. Dist. is displayed by SPSS in the "Residuals Statistics: table when "Casewise diagnostics" is checked under the Statistics button in the Regression dialog. 4) Cook's Distance (D). Another measure of the influence of a case. Cook's distance measure the effect of deleting a given observation. Observations with larger D values than the rest of the data are those which have unusual leverage. Fox (1991) suggests as a cut-off for detecting influential cases, values of D greater than 4/(n-k-1), where n is the number of cases and k is the number of IVs. Others suggest D > 1 as the criterion to constitute a strong indication of an outlier problem, with D > 4/n the criterion to indicate a possible problem. 5) DfBeta called standardized dfbeta(s) in SPSS, is another stat for assessing the influence of a case. If dfbeta > 0, the case increases the slope; if < 0, the case decreases the slope. The case may be considered an influential outlier if the absolute value of dfbeta is greater than 2/SQRT(n). 4) dfFit measures how much the estimate changes as a results of a particular observation being dropped from analysis. *In ANOVA* HOMOGENEITY OF VARIANCES. The DV should have the same variance in each category of the IV. When there is more than one IV, there must be homogeneity of variances in the cells formed by the independent categorical variables. The reason for this assumption is that the denominator of the F-ratio is the within-group mean square, which is the average of group variance taking group sizes into account. When groups differ widely in variances, this average is a poor summary measure. Violation of the homogeneity of variances assumption will increase type I errors in the F tests (wrongly rejecting the null hypothesis). The more unequal the sample sizes in the cells, the more likely violation of the homogeneity assumption. However, ANOVA is robust for small and even moderate departures from homogeneity of variance (Box, 1954). Still a rule of thumb is that the ratio of largest to smallest group variances should be 3:1 or less. Moore (1995) suggests the more lenient standard of 4:1. When choosing, remember that the more unequal the sample sizes, the smaller the differences in varaiances which are acceptable. Marked violations of the homogeneity of variances assumption can lead to eithe rover-or under-estimation of the sig. level. disrupt the F-test. 1) Levene's test of homogeneity of variance is computed by SPSS to test the ANOVA assumption that each group (category) of the IVs has the same variance. If the Levene statistic is sig. at the .05 level or better, the researcher rejects the null hypothesis that the groups have equal variances. The Levene test is robust in the face of departures from normality. Note, however, that failure to meet the assumption of homog. of variances is not fatal to ANOVA, which is relatively robust, particularly when groups are of equal sample size. Example. When groups are of very unequal sample size, Welch's variance weighted ANOVA is recommended. 2) Bartlett's test of homog. of variance is an older test which is alternative to Levene's test. Bartlett's test is a chi-square stat with (k-1) DF, where K is the number of categories in the IV. But Levene's has largely replaced Bartlett's test bc Bartlett's is dependent on meeting the assumption of normality. 3) Box plots are a graphical way of testing for lack of homogeneity of variances. One requests side-by-side boxplots of each group, such that samples form the x axis. The more the width of the boxes varies markedly by sample, the more the assumption of homog. of variances is violated. *In MANOVA* HOMOESCEDASTICITY (homogeneity of variances and covariances): within each group formed by the categorical IVs, the variance of each interval DV should be similar as tested by Levene's test. Aslfo for each of the k groups formed by the IVs, the covaraicne bt any two DVs must be the same. When sample sizes are unequal, tests of group diffs (Wilks, Hotelling, Pillai-Bartlett, GCR) are not robust when this assumption is violated. Pillai-Bartlett trace was found to be more robust than the alternatives when this assumption was violated but sample sizes of the groups were equal (Olson, 1976). 1) Box's M: Tests MANOVA's assumption of homoscedasticity using the F distribution. If p(M) < .05, then the covariances are significantly different. Thus we want M not to be significant, rejecting the null hypothesis that the covariances are not homogenous. That is, the probability value of this F should be greater than .05 to demonstrate that the assumption of homoscedasticity is upheld. Note, however, that Box's M is extremely sensitive to violations of the assumption of normality, making the Box's M test less useful than might otherwise appear. For this reason, some researchers test at the p = .001 level, especially when sample sizes are unequal. 2) Levene's Test: SPSS also outputs Levene's test as part of MANOVA. If Levene's test is sig., then the data fail the assumption of equal group variances.

METHODS - Assumption of Sphericity Previous comps question: "Define the assumption of sphericity which is required for repeated-measures ANOVA. When will this be a problem? What effect does violation of this assumption have on the statistical sig. test? What should be done if this assumption is violated?

*SPHERICITY ASSUMPTION* 1) What is sphericity? a) Sphericity refers to the equality of the variances of the differences between levels of the repeated measure factor. In other words, we calculate the differences between each pair of levels of the repeated measures factor and then calculate the variance of these difference scores. Sphericity requires that the variances for each set of difference scores are equal. 2) How do you assess sphericity? a) When you conduct an ANOVA with a repeated measures factor, SPSS will automatically conduct a test for sphericity, the Mauchly's test. Interpreting this test is straightforward. When the sig. level of the Mauchly's test is less than or equal to .05, we cannot assume sphericity, but when it is greater than or equal to .05, we can assume sphericity. 3) How do you deal with violations of sphericity? a) When you conduct an ANOVA w/ a repeated measures factor, SPSS will automatically generate three corrections for violations of sphericity. These are the Greenhouse-Geisser, the Huynh-Feldt, and the lower-bound corrections. To correct for sphericity, these corrections alter the degrees of freedom, thereby altering the sig. value of the F-ratio. There are diff opinions about the best correction to apply. The one you choose depends on the extent to which you wish to control for Type I errors. A good rule of thumb is to used the Geyser estimate unless it leads to a different conclusion from the other two. Another option when sphericity is violated is to use the multivariate test (Wilks Lambda, Pillai's Trace, Hotelling-Lawley Trace, Roy's Greatest Root). However, multivariate tests can be less powerful than their univariate counterparts. In general, the following guideline is useful. When you have a large violation of sphericity (epsilon < .7) and your sample size is greater than 10 plus number of levels of the repeated measure factor, then multivariate tests are more powerful. In other cases, the univariate test should be preferred. The epsilon value is found in the Mauchly's test.

PSYCHOMETRICS - IRT "What are the assumptions of IRT?"

*The mathematical models employed in IRT specify that an examinee's probability of answer a given item correctly depends on the examinee's ability or abilities and the characteristics of the item. IRT models include a set of assumptions about the data to which the model is applied. Although the viability of assumptions cannot be determined directly, some indirect evidence can be collected and assessed, and the overall fit of the model to the test data can be assessed as well (Hambleton et al 1991) An assumption common to the IRT models most widely used is that only one ability is measured by the items that make up the test (Assumption of unidimensionality). A concept RELATED to unidimensionality is that of local independence (which states that the conditional probability of responding correctly to an item, conditioned upon theta, is uncorrelated to the conditional probability of responding to any other item). Assumptions 1) Unidimensionality 2) Local independence (in other words, once you take theta into account, there is no residual correlation between items). Can test through residual plots to determine if there is no pattern with the item residuals across items (errors are not correlated between items) - May be violated by item sets or item enemies *Similarity to CTT is that this is essentially the same assumption as made in CTT when it is assumed that the error component is uncorrelated with all other components of the test

PSYCHOMETRICS - CTT vs IRT Previous comps question: 1) "How do classical test theory and item response theory differ in the conceptualization of measurement error (i.e., unreliability?) 2) Explain how IRT can be used to develop a test that is both shorter and more reliable than using a CTT approach.

1) 1) In IRT, SEM changes as a fxn of examinee trait level 2) In CTT, SEM is constant, regardless of examinee raw score 2) Pros and cons of IRT IRT PROS 1) Stronger, potentially more useful results as compared to CTT 2) Based on strong, statistical models; much more scientific approach to measurement 3) Characterizes error much better than CTT 4) Enables new applications like DIF, adaptive testing; greatly enhances existing applications like equating 5) Parameter invariance? IRT CONS 1) IRT makes more and stronger assumptions as compared to CTT 2) Requires statistical estimation, not just computation 3) Overall add numerous kinds of practical complications Overall, IRT is an extension to CTT and a partner, not a successor. IRT exhibits a number of advantages over CTT as well as some weaknesses mainly due to the complexity of IRT. In terms of the advantages, IRT makes stronger and more useful results when compared to CTT, particularly in terms reliability. IRT is also based on much stronger, scientific models than CTT and thus produces more valid results. Moreover, since IRT is a model-based approach, statistics can be used to test if the data fits the model appropriately. Additionally, IRT characterizes error much better than CTT because IRT provides estimates of error for individual ability estimates instead of just a single estimate of error for all people that take a given test. In term of practical applications, IRT enables a number of new applications, like DIF and CAT. Disadvantages of IRT include stronger assumptions (unidimensionality and local independence as explain in the previous question, make it more difficult to not violate the assumptions than CTT. IRT also requires complex estimation of the item parameters and a person's ability in order to create the IRT statistics and graphs. In CTT it is much simpler because you only need to rely on simple statistics to determine how good an item is. Finally, there are practical downsides to IRT in that a larger sample is required (N=200 at least) and the software is much more complex and more difficult to interpret than the CTT statistics. Some of the aspects of CTT and IRT are similar, in particular some of the statistics that are analyzed. e.g., the a parameter (slope) is analogous to the CTT biserial item-total correlation and the b parameter is analogous to the proportion correct in CTT.

PSYCHOMETRICS - Local independence assumption 1) "Why might the assumption of local independence seem counterintuitive?" 2) "Why can local independence be obtained even when the dataset is not unidimensional?" (NOTE, when unidimensionality is true, local independence is obtained, but not necessarily vice versa) 3) How to test for unidem. and local independ?

1) An examinee's responses on several test items cannot be expected to be uncorrelated; that is, the responses are unlikely to be independent. When variables are correlated, they have some traits in common. When these traits are "partialled out" or "held constant" the variables become uncorrelated. This is the basic principle underlying factor analysis. Similarly, in IRT, the relationships among an examinee's responses to several test items are due to the traits (abilities) influencing performance on the items. After "partialling out" the abilities (i.e., conditioning on ability), the examinee's responses to the items are likely to be independent. 2) When the assumption of unidimensionality is true, local independence is obtained: the two concepts are equivalent (Lord, 1980: Lord, 1968). Local independence can be obtained however, even when the dataset is not unidimensional. Local independence will be obtained when all the ability dimensions influencing performance have been taken into account (Hambleton et al 1991) -Unidem. is essentially same thing as local independ. (it is assumed that a single trait underlies all item responses and if there are item-specific factors, they are all utually uncorrelated (Dimensionality is best view from a factor-analytic perspective) 3a) To test local independ., there are LID stats that can be computed, the most "widely used" is Yen's Q3 stat which correlates the residuals after statistically controlling for theta. 3b) To test for unidimen., Lord recommended factoring the tiems (ideally the tetrachoric intercorrelation matrix) ((tetrachoric correlations: corrects for artificial dichotomoy of both artificially dichotomous item scores) and examining the scree plot to ensure: 1) That the first eigenvalue is far greater than the second, and 2) That the second and subsequent eigenvalues are homogeneously small

PSYCHOMETRICS - Computer Adaptive Testing (CAT) Lecture Notes Previous comps question: "How does CAT work?"

1) An initial ability estimate is assumed for the current ability estimate 2) The item pool is searched for an informative item, given the current ability estimate 3) The informative item is administered and scored 4) The current ability estimate is updated based on the right or wrong response 5) Steps 2-4 are repeated until a stopping criterion is achieved Each person who is given a "tailored" test that is maximally informative for their theta-hat (subject to the limitations of the "item pool"). In general, the initial item is an item of medium difficulty and in general when the test-taker answer correctly, the CAT administers a harder item and when the test-taker answers incorrectly, an easier item. If all goes well, each test-taker will have taken a test whose difficulty was optimized for their ability and everyone will have gotten about 50% of the items correct. IRT is the psychometric theory needed to make CAT work. IRT provides a way to scale people and items, item information provides a mechanism for selecting items, and theta-hat is the "score" in the test that is comparable even though diff people take a different set of items

PSYCHOMETRICS - DIF Previous comps question: "Where should differential functioning be assessed?"

1) At the item level: Does the item fxn diff for individuals from two diff subpops with the same true score on the underlying construct? 2) At the test level: DTF assesses tdtw the raw score on the underlying construct is the same at the test level fro two examinees from diff subpops with identical standing on the underlying construct *Assessing DTF is important because most decisions are made at the test rather than item level *Most decisions are not made by examining if a person correctly answered an individual item, but by examining performance on the entire test *Important to note that even if a test contains items that have DIF, the test may not necessarily fxn diff (may not have DTF) a) This may be due to the fact that the level of DIF may not be practically sig b) Some DIF items may favor the focal group and some items may favor the reference group, and essentially cancel the overall effects of DIF at the test level

PSYCHOMETRICS - DIF Previous comps question: "What is DIF?"

1) DIF Is when given the same standing on an underlying construct, individuals from diff groups have diff probs of answering the item correctly 2) The prob of getting an item correct should NOT depend on group membership. Basically, this means that ICCs for 2 groups should be virtually identical. If IRFs are not identical, this indicates that the prob of getting an item right depends on ability and group membership - this is DIF 3) In other words, an item functions diff when the prob of answering an item correctly depends not only on the underlying ability or standing on a trait, but also group membership 4) DIF is usually assessed between two groups, a reference group and a focal group *The focal group is the group of interest and the reference group is the group to which the focal is being compared *These groups are arbitrary; whichever is the focal group or the reference group depends on which group you want to examine (bc of Lord's paradox - this is not always the case because if your ICCs cross one group will not always have a higher prob of getting the item correct and it will depend on where the ability continuum you are referring to 5) Diff bt probs can be identified using ICCs. DIF implies that the 2 groups do not share the same ICC. When the ICCs (after equating on the same metric) are equivalent (within the range expected by sampling error), the item does not have DIF 6) There are 2 types of DIF: a) Uniform DIF occurs when the item is more difficult at all ability levels for one group than for the others (So the ICCs do not cross) b) Non-uniform DIF - this type of DIF does not favor the same group across all levels of theta so the ICCs do cross. (In other words, there is an interaction between ability level and group, so that the item is more difficult for one group at lower levels of ability but more difficult for the other group at higher levels of ability (Lord's paradox) *According to Raju (1988), non-uniform DIF will occur in the 2PL and 3PL model when the "a" parameters for the focal and reference groups are unequal 7) Metric issues *To identify DIF, the parameters for the focal and reference groups must be placed on a common metric, and so the theta values must be linked *e.g., item parameters for a set of items for 2 diff groups will be diff because the metric defined by each independent calibration of the items will be diff (each group will have diff estimates for a., b, and c parameters). Identifying items with DIF requires that the item parameters for the reference and focal group be compared, adn so the item parameters must be on the same metric. *However, one must only use unbiased items to link the theta values, even though these are the same items that are being examined for DIF items ***To avoid this problem, Candell 1983 recommended an iterative linking procedure

PSYCHOMETRICS 1) What is b? 2) What does an increase in b lead to?

1) Difficulty of an item, aka the point on the ability scale where the prob of a correct response is .50 2) An increase in the ability needed for the examinee to have a 50% chance of getting the item right, hence the harder the item. Difficult items are located to the right or the higher end of the ability scale; easy items are located to the left or the lower end of the ability scale (Hambleton et al 1991) -The point of inflection where the rate of change shifts from accelerating increases to decelerating increase, occurs where the prob of passing an item is .50. -The b parameter for an item is the point on the ability scale where the prob of a correct response is .50. This parameter is a location parameter, indicating the position of the ICC in relation to the ability scale. The greater the value of the b parameter, the greater the ability needed to have 50% chance of getting the item right -When the ability values of a group are transformed so that their mean is 0 and their SD is 1, the values of b vary (typically) from about -2 to +2. Values of b near -2 are very easy items, and value of b near +2 are very difficult -More extreme difficulty parameters have larger standard errors

PSYCHOMETRICS - DIF Previous comps question: "What are some concepts related to DIF?"

1) Fairness - can be defined in a # of ways and mean diff things to diff people and therefore can't really be tested: The absence or presence of DIF does not imply or deny fairness 2) Bias - systematic diffs that are unrelated to the trait being measured. Bias exists in regards to construct validity when a test is shown to measure diff traits for one group than another or to measure the same trait but with differing degrees of accuracy 3) Impact - systematic diffs in group means: impact and DIF are two diff concepts and one does not imply the other *An item has impact if it's more or less difficult for one group than the other. this may simply reflect real group diffs *In some situations, impact is expected. e.g., an item that reflects a high school reading level would be expected to show impact between a group of 11th graders and a group of 8th graders

PSYCHOMETRICS - How to develop a shorter test using IRT?

1) Have to develop a large pool of different items that maximize item information for different ability levels 2) In order to have a test with less items but also low SEM, you want to use items with highest IIF. The IIF is maxed at the level of .5 which equals b parameter 3) SEM at the test level equals 11/sqrt TIF where TIF equally the sum of all the IIFs 4) Choosing items that have the maximal IIF for each ability level

METHODS - What can be done to handle multicollinearity?

1) Increasing the sample size is a common first step since when sample size is increased, standard error decreases (all other things equal). This partially offsets the problem that high MC leads to high standard errors of the b and beta coefficients. 2) Use centering: Transform the offending independents by subtracting the mean from each case. The resulting centered data may well display considerably lower MC. You should have a theoretical justification for this consistent with the fact that a zero b coefficient will now correspond to the independent being at its mean, not at zero, and interpretations of b and beta must be changed accordingly. Centering is particularly important when using quadratic (power) terms in the model. 3) Combine variables into a composite variable. This requires there be some theory which justifies this conceptually. This method also assumes that the beta weights of the variables being combined are approximately equal. In practice, few researchers test for this. 4) Removed the most intercorrelated variable(s) from analysis. This method is misguided if the variables were there due to the theory of the model, which they should have been. 5) Drop the intercorrelated variables from analysis but substitute their crossproduct as an interaction term, or in some other way combine the intercorrelated variables. This is equivalent to respecifying the model by conceptualizing the correlated variables as indicators of a single latent variable. Note. If a correlated variable is a dummy variable, other dummies in that set should also be included in the combined variable in order to keep the set of dummies conceptually together. 6) Leave one intercorrelated variable as is but then remove the variance in its covariates by regressing them on that variable and using the residuals. 7) Assign the common variance to each of the covariates by some probably arbitrary procedure. 8) Treat the common variance as a separate variable and decontaminate each covariate by regressing them on the others and using the residuals. That is, analyze the common variance as a separate variable. 9) Use orthogonal PCA, then use the factors as independents.

METHODS - MANOVA Previous Comps Question: 1) When is it appropriate to use MANOVA?

1) MANOVA is used to see the main and interaction effects of categorical variables on multiples dependent interval variables. MANOVA uses one or more categorical independents as predictors, like ANOVA, but unlike ANOVA, there is more than one DV. Where ANOVA tests the differences in means of the interval dependent for various categories of the independent(s), MANOVA tests the differences in the centroid (vector) of means of the multiple interval dependents, for various categories of the independent(s). One may also perform planned comparison or post-hoc comparisons to see which values of a actor contribute most to the explanation of the dependents. There are multiple potential purposes for MANOVA: 1) To compare groups formed by categorical IVs on group differences in a set of interval DVs. 2) To use lack of difference for a set of DVs as a criterion for reducing a set of IVs to a smaller, more easily modeled number of variables. 3) To identify the IVs which differentiate a set of DVs the most. Multiple analysis of covariance (MANCOVA) is similar to MANOVA, but interval independents may be added as "covariates." These covariates serve as control variables for the independent factors, serving to reduce the error term in the model. Like other control procedures, MANCOVA cab be seen as a form of "what if" analysis, asking what would happen if all cases scored equally on the covariates, so that the effect of the factors over and beyond the covariates can be isolated. *ASSUMPTIONS OF MANOVA* *1) Observations are independent of one another.* MANOVA is NOT robust when the selection of one observation depends on selection of one or more earlier ones, as in the case of before-after and other repeated measures designs. However, there does exist a variant of MANOVA for repeated measures designs. *2) The IV (or IVs) is categorical.* *3) The DVs are continuous and interval-level.* *4) Low measurement error of the covariates.* The covariate variables are continuous and interval-level, and are assumed to be measured without error. Imperfect measurement reduces the statistical power of the F test for MANCOVA and for experimental data, there is a conservative bias (increased likelihood of Type II errors: Thinking there is no relationship when in fact there is a relationship

PSYCHOMETRICS - Special Correlations Occurring in Test Development SUPER IMPORTANT KNOW THIS COLD Previous comps question: "There are four "special correlations which occur frequently in psychometric work, describe them in detail"

1) Pearson correlation (USE WHEN BOTH VARIABLES ARE CONTINUOUS) -The ubiquitous, "regular" correlation that you know and love. No "corrected form", ranges from -1 to 1 2)Point-biserial correlation (USE WHEN ONE VARIABLE IS DICHOTOMOUS AND THE OTHER IS CONTINUOUS) -Occurs when you correlate a right/wrong item score with a relatively continuous variable like a scale score or criterion. Range of values depend on underlying relationship and proportion getting item right/wrong. Range is maximized when px=py=.5 - otherwise, cannot obtain -1 or 1. Numerically equivalent to the Pearson correlation 3) Phi correlation (USE WHEN BOTH VARIABLES ARE DICHOTOMOUS) -Occurs when two dichotomous variables, like to item right/wrong scores, are correlated. Range of values depend on underlying relationship and proportions getting items right/wrong; range is maximized when px=py=.5 - otherwise, cannot obtain -1 or 1. Numerically equivalent to the Pearson correlation 4) Biserial correlation (USE WHEN ONE VARIABLE IS DICHOTOMOUS AND THE OTHER IS CONTINUOUS) -Similar to point biserial, corrects for artificial dichotomy of the dichotomous item score, if correction works, values range from -1 to 1; it is possible to obtain values beyond -1, 1. Numerically always greater than the point-biserial/Pearson correlation (can be written as the point-biserial with a correction factor) 5) Tetrachoric correlation (USE WHEN BOTH VARIABLES ARE DICHOTOMOUS) -Similar to Phi correlation; corrects for artificial dichotomy of both dichotomous item scores; if correction works, values range from -1 to 1, it is possible to obtain values beyond that. Useful in factor analysis of items (but sensitive to guessing). Difficult to calculate.

METHODS - Difference bt PCA and PFA

1) Principle Components Model: By far the most common form of factor analysis, PCA seeks a linear combination of variables such that the maximum variance is extracted from the variables. It then removes this variance and seeks a second linear combination which explains the maximum proportion of the remaining variance, and so on. This is called the principal axis method and results in orthogonal (uncorrelated factors) PCA analyzes total (common and unique) variance. Example, component 1 increases standard variance, but then so does component 2. Results in uncorrelated factors. 2) Principle Factor Model: Also call principal axis factoring (PAF), and common factor analysis, PAF is a form of factor analysis which seeks the least number of factors which can account for the common variance (correlation) of a set of variables, whereas the more common pCA in its full form seeks the set of factors which can account for all the common and unique (specific plus error) variance in a set of variables. PAF/PFA uses a PCA strategy but applies it to a correlation matrix in which the diagonal elements are not 1's, as in PCA, but iteratively-derived estimates of the communalities (R-squared of a variable using all factors as predictors). Example, only considers common variance and seeks out least # of variables which can explain the common variance. *Differences between PCA and FA* 1) The difference bt FA and PCA lies in the reason that variables are associated with a factor or component. Factors are thought to "cause" variables - the underlying construct (the factor) is what produces scores on the variables (Tabachnick & Fidell, 2007). 2) Components are simply aggregates of correlated variables. The variables "cause" or produce the component. There is no underlying theory about which components should be associated with which variables; ther are simply empirically associated (T & F, 2007). 3) In either PCA or PFA, the variance that is analyzed is the sum of the values in the positive diagonal. a) In PCA, 1s are in the diagonal and there is as much variance to be analyzed as there are observed variables; each variable contributes a unit of variance by contributing a 1 to the positive diagonal of the correlation matrix. ALL the variance is distributed to components, including error and unique variance for each observed variable. So, if all the components are retained...PCA duplicates exactly the observed correlation matrix and the standard scores of the observed variables (T & F, 2007) b) In FA, only the variance that each observed variable shares with other observed variables is available for analysis. Exclusion of error and unique variance from FA is based on the belief that such variance only confuses the picture of underlying processes (T & F, 2007). 1) Shared variance is estimated by communalities. Communalities are values between 0 and 1 that are inserted in the positive diagonal of the correlation matrix. The solution in FA concentrates on variables with high communality values. The sume of the communalities is the variance that is distributed among factors and is less than the total variance in the observed variables. Because unique and error variances are omitted, a linear combination of the factors approximates, but does not duplicate, the observed correlation matrix and scores on observed variables (T & F, 2007). 4) PCA analyzes variance; FA analyzes covariance (communality) 5) Goal of PCA is to extract maximum variance from a dataset with a few orthogonal components. Goal of GA is to reproduce the correlation matrix w/ a few orthogonal factors. (CHECK THIS IT'S CROSSED OUT)

PSYCHOMETRICS - Reliability Previous comps question: "Briefly describe 3 tactics for increasing the estimated reliability of a test. Discuss the strengths and weakness of each tactic."

1) Test format: MC vs Essay a) Strengths: Increased reliability b) Weaknesses: May not get the quality of info that you are seeking to get as you would with an essay test 2) Test w/ less error variance a) Strengths: account for any systematic errors that can be controlled 3) Test w/ more true score variance (i.e., added more items) a) Strength: Generalizability to other populations 4) Increasing Internal Consistency (test w/ strongly intercorrelated items) a) Strengths: Increased reliability b) Weaknesses: Decreased validity 5) Increasing test length a) Strengths: In CTT, tests that are longer are more reliable. b) Weaknesses: In IRT, the claim that longer tests are more reliable is discounted. This is illustrated by comparing the SEM between traditional fixed content tests and adaptive tests. The SEM from the 20 item CAT is lower for more trait levels than from the 30-item fixed content test. This is typical pattern for CAT. The implication of course is that the shorter test yields less measurement error. Thus a more reliable "test" has been developed at a shorter length with items of the same quality (i.e., item discriminations are constant). Thus a composite reliability across trait levels would show the short CAT as more reliable than the longer normal range test.

PSYCHOMETRICS 1) What is a? 2) What does an increase in a lead to?

1) a is the slope parameter (amount of information an item provides). 2) An increase in a leads to an increase in information an item provides around the difficulty parameter. - The larger the a parameter the taller the information curve, which implies that is does a better job of discriminating between individuals at that ability level. However, if the a parameter is too big it will do a good job at discriminating among people at a limited ability range - meaning that the item information function (IIF) will be very tall but very narrow. -Negatively discriminating items are discarded from ability tests because something is wrong with an item if the probability of answering it correctly decreases as examinee ability increase (Hambleton et al, 1991) -It is unusual to obtain a values larger than 3. The usual range for item discrimination parameters is (0,2). High values of a result in item characteristic fxns that are very "steep" and low values of a lead to item characteristic fxns that increase gradually as a function of ability (Hambleton et al 1991) - ICCs may have diff slopes bc if they vary in their a parameter. Therefore ICCs in the 2PL model can cross each other -Lord's paradox, due to the ability of ICCs to cross in this model, lord's paradox can occur. Lord's paradox is where item difficulty cannot be unilaterally etermined along the ability contunuum. Along certain points, one item will be considered more difficult and after ability level where ICCs cross, the other item will be considered more difficult. -If a=0, the ICC is a flat line - which eans that everyone has the same prob. of getting the item correct. No matter what aility level you have - the prob. for getting the item correct is the same. -If a=infinity, this is referred to as a guttman item

PSYCHOMETRICS - Item parameters have an effect on IIFs 1) The larger "a" parameter results in ________________________ IIF. (i.e., more info over a small range of scores) 2) The location of the fxn (and thus info) depends on ______________ (i.e., want to be sure that your item difficulties are appropriate for the purpose of your test 3) Its "c" becomes larger, IFF becomes _________ (because guessing makes test scores less precise) and ___________ (relatively lower info for lower abilities)

1) a taller, more peaked IIF 2) the "bi" parameter 3) lower , more asymmetric

METHODS - Cross-Validation Previous Comps Questions 1) Using data from a criterion-related validity study, you have created a multiple regression equation that predicts job performance as a function of seven test scores. Describe how you would use cross-validation to evaluate the stability of the regression equation. How is cross-validation related to the adjusted-R-squared statistic?

Cohen et al (2003) The real question in prediction is not how well the regression equation determined for a sample works on that sample, but rather how well it works in the population or on other samples from the population (Cohen et al, 2003). Note that this is NOT the estimate of the adjusted R-squared but rather an estimate of the "cross-validated" r-squared for each sample's beta applied to the other sample, which is even more shrunken and which may be estimated by: Cross-validated R^2 = 1 - (1 - R^2) [(n+k)/(n-k)] where: R^2 or R-squared = the sample estimate N = Sample size K = # of predictors NOTE. As N gets very large, the sample estimate is predicted to be estimate of population value. If k is larger than N, then it will give you slightly smaller values or R-squared. Cross-validated R-squared answers the relevant question: "If I were to apply the sample regression weights to the population, or to another sample from the population, for what proportion of the Y variance would my thus-predicted Y values account" So rather than actually doing a cross-validation, we can estimate what would happen if we did do a cross-validation with the above formula. *Cross-validation:* To perform cross-validation, a researcher will either gather two large samples or one very large sample which will be split into two samples via random selection procedures. The prediction equation is created in the first sample. The equation is then used to create predicted scores for the members of the second sample. The predicted scores are then correlated with the observed scores on the DV (ryy'). This is called the cross-validity coefficient. The difference between the original R-squared and ryy'2 is the shrinkage. The smaller the shrinkage, the more confidence we can have in the generalizability of the equation (Osborne, 2000). a) Shrunken or adjusted R-squared: It is often desirable to have an estimate of the population squared multiple correlation (and of course we want one that is more accurate than the positively biased sample R-squared). Such a reliable estimate of the population squared multiple correlation is given by: R^2 = 1 - (1-R^2)*(n-1) / (n-k-1) This estimate is appropriately smaller than the sample R-squared and is often referred to as the "shrunken" R-squared. OTHER NOTES. The macnitude of shrinkage will be larger for small values of R-squared than larger values, other things being equal. Shrinkage also larger as the ratio of the # of IVs to the # of subjects increases.

METHODS - MANOVA Previous comps question: "Discuss the research questions addressed by MANOVA, the underlying statistical model, significance testing and follow-up procedures used with each method."

MANOVA: The fundamental equation, the underlying statistical model, significance testing and follow-up procedures: *MANOVA follows the model of ANOVA where variance in scores is partitioned into variance attributable to difference among scores within groups and to differences among groups. Squared differences between scores and various means are summed; these sums of squares, when divided by appropriate degrees of freedom, provide estimates of variance attributable to different sources (main effects of IVs, interactions among IVs, and error). Ratios of variances provide tests of hypotheses about the effects of IVs on the DV In MANOVA, however, each subject has a score on each of several DVs. When several DVs for each subject are measured, there is a matrix of scores (subjects by DVs) rather than a simple set of DVs within each group. Matrices of difference scores are formed by subtracting from each score an appropriate mean; then the matrix of differences is squared. When the squared differences are summed, a sum-of-squares-and-cross-products matrix, an S matrix, is formed, analogous to a sum of squares in ANOVA. Determinant of the various S matrices are found, and ratios between them provide tests of hypotheses about the effects of the IVs on the linear combinations of DVs.* *THERE ARE TWO MAIN STEPS TO MANOVA* 1) Define a new linear composite (LC) of the outcome variables that best distinguish among groups. a) New variable = y* =a1y1+a2y2 (this is a weighted composite of the old variables). b) If there is a larger difference between groups on y1 (more variability), then y1 would get more weight bc we are trying to maximize the differences bt groups. c) If there is similar distance bt the two groups on both variables, then they are given similar weights. d) The key is to maximize between group variance and relative to within group variance 2) When you have the new LC, the next step is to determine if the groups differ on the LC of the outcome variables, this is the ultimate goal. a) You are more likely to see sig diffs bt groups when outcome variables have oval patterns, this means there is less within group variance. b) When you have more than two groups, they do not line up to make a clear linear composite (LC). i) The 1st LC accounts for the most difference, however this single variable does not capture all the difference between groups, a 2nd dimension must be defined. i) The biggest difference bt groups is seen on the 1st dimensions, and progressively small difference is seen on each next LC. i) Different composites may distinguish among different variables better. *MANOVAs vs separate ANOVAs* 1) May have more power using MANOVA (but depends on situation). 2) Implication for type I error; when you do multiple tests (ANOVAs), type I error builds up, lose power. COULD correct ANOVAs with Bonferroni adjustment, lowers alpha level and lessens power. 3) ANOVAs ignore correlations among outcome variables, this has implications for interpretation. 4) Doing MANOVA looks at diffs bt groups on a LC and the composite may reflect an underlying theoretical variable; looking at weights for diff variables on LC gives us theoretical constructs that distinguish among groups. *WHEN NOT TO USE MANOVA* Huberty and Morris argued that MANOVA is overused and the LC of random variables are not meaningful; conceptually the variables of LC should be related for composite to make sense: 1) USE SEPARATE ANOVAs when: a) Outcome variables are conceptually distinct b) Exploratory analysis c) Comparing new research to past research d) Want to demonstrate that groups are equivalent (more meaningful to check diff on specific variables than the whole set). *ASSUMPTIONS OF MANOVA* *1) Outcomes have multivariate normal distribution* a) Test each outcome separately through histograms, look to see if the data looks normally distributed. b) F-tests are generally robust to violations of this assumption, especially true as sample size increases. c) Non-normality can reduce power sometimes, can transform the data. *2) Homogeneity of the covariance matrices.* a) Assumes the entire covariance matrix is the same for each group. b) Equal correlations and variances among variables across groups. c) Look at Box's M test in SPSS, if sig., then violation has occurred. d) if sig., look at Levene's test, this will tell you which variables are showing heterogeneity. i) If sig., there is a problem. ii) Heterogeneity of variance can be related to non-normal data. e) If a violation has occurred, the test is robust to it if the group sample sizes are relatively equal. f) If sample sizes are different, the test results can be biased. i) If group with the smaller "n" has larger variance, the test will have inflated type I error rate. ii) If group with larger "n" has larger variance, test will be overly conservative and power will be low g) What to do is this assumption is violated? i) Transform data. ii) Alternative tests. iii) Adjust alpha level based on high type I (lower alpha) or low power (lower alpha). *3) Observations are independent.* *COMPUTATIONAL PROCEDURES OF MANOVA* 1) When only have 2 groups, use Hotelling's t-test, this is a simple form of MANOVA a) Hotelling's T-squared is almost identical to univariate just taking t-test and generalizing to multivariate case. b) H0: mu1 = mu2 (centroid of group 1 = centroid of group 2); groups equal on all variables. c) Complication is that there is no distribution for T-squared and can't test sig., so resolution is to transform T-squared into F-stat to test for sig. 2) With more than 2 groups... a) The null hypothesis is that there is *no diff* *on the LC* of scores bt each of the groups. b) Partition the variance into 2 sources - bt group (hypothesis) and within group (error) c) Need to look at cov matrix (bc there are multiple variables, look at all variances and intercorrelations = covariance matrix d) SSCP = sums of squares and cross product matrix, covariance matrix can be computed from this e) COV matrix = (1/n-1) SSCP f) Partition the SSCP into between and within components (they are additive) g) SSCP(total) = SSCP(between) + SSCP(within) h) For the SSCP (between), aka H, the diagonal elements are the SS of the variables - looks like the original SSCP, but now looks at group means instead of individual data points. i) For the SSCP (within), aka E, measure correlation between each of the variables j) Need to compute H E^-1 in order to get the F-test....this gives you a matrix, still want a single figure for the F-test, so diagonalize H E^-1 i) The largest value of the matrix shows up in the top left spot (these are called eigenvalues). ii) Each eigenvalue represents the relative b/t group to w/in group variance on a linear composite of the outcome variables. iii) By doing this, you create a number of new LCs, which are orthogonal to each other. iv) LC = y1*=a11y1+a12y2+...lambda1 (ratio of b/t to w/in group variance on y1* v) The 1st eigenvalue represents the LC w/ the largest ratio of b/t to w/in group variance this is the best to distinguish among groups. vi) The 2nd eigenvalue is the LC that is 2nd best at distinguishing b/t groups and so on. vii) End up with a set of orthogonal, LCs with each one explaining as much variance as possible viii) By diagonalizing HE^-1, we are able to find optimal LC (lambda) and ratio of b/t to w/in group variance viiii) Once we know eigenvalues, solve for eigenvectors, which are the set of weights defining the LC (# of eigenvalues will be the smaller of the # variables and number of groups - 1 k) Test statistic (compute overall test stat representing all the lambda's, there is no simple solution, there are several test stats that combine lambdas. (SPSS gives you 4 test stats to summarize lambdas, which is the relative b/t groups to win groups variance on LC of outcome variables) i) Wilk's lambda - gives exact value with 2 eigenvalues ii) Pilla's V iii) Roy's largest root (only takes into accoutn 1st, largest eigenvalue iv) Hotelling's trace - sum of all eigenvalues l) Now have one number that sums all the w/in and b/t group variances, however none of these statistics can be called sig. need to translate them into an F-statistics (then evaluate the F-statistics for sig.). *Follow-up tests, if MANOVA is sig., what do you do next?* Identify which outcome variables are contributing to this diff; 2 types of follow-up questions...which outcome variables contribute to the diff and which groups are diff from each other? 1) Which outcome variables contribute to the difference? a) Protected F-test (most common) i) Separate ANOVAs on each of the outcome variables and look at each sig. effect (main or interaction). ii) Doing ANOVA after MANOVA controls for type I error build up (bc not doing ANOVA unless F-test is sig.). iii) If null hypothesis is partially true, get inflated type I errors. iv) If null hypothesis is true, have no type I errors. v) This approach ignores correlations among variables. b) Step-down F-test i) Only works when have logical ordering of outcome variables (outcomes are causally related). ii) Series of ANCOVAs (ANCOVA looks at group diffs controlling for covariates/nuisances). iii) Take 1st outcome variable and do univariate F-test for sig diffs bt groups (ANOVA). iv) Take 2nd outcome and do ANCOVA controlling for 1st outcome, take 3rd outcome and control for 1st and 2nd outcome to see if this variable has effect on group diffs. v) Each step you get unique group diffs controlling for other variables (tells us extent to which each variable has a unique diff). vi) This analysis adjusts for correlations bt variables and only looks at unique contributions of each variable to group diffs. c) Discriminant Function i) Discriminant function (DF) = the set of weights that define the LC (MANOVA looks at diff bt groups on LC; do groups differ on LC)? ii) Look at the weights to determine relative contribution of each variable to defining the group difference. iii) Discriminant coefficients - actual raw weights (small, regular, to large coefficients) iv) Standardized coefficients - between -1 and 1) represent unique contributions of each outcome variable to the group difference. v) Discriminant loadings can be interpreted like factor loadings, correlation bt each outcome variable and DF; useful to assign meaning to theoretical variable that is distinguishing bt groups (loadings help to define what LC mean). Identify which populations are different from each other; which groups are diff from each other 1) Multivariate post-hoc comparisons i) Run Hotelling's T-test for each pair of groups to see if sign diff bt pair of groups a) Compare and repeat for all groups/pairs b) Control for Type I error build-up with Bonferroni corrections (but with many groups, your sig criteria will be too conservative, lose power) ii) Roy-Bose Simultaneous CIs a) Build a CI around each group on IV (types of training) and each outcome variable to control for Type I error for all possible comparisons (each training group gets CI on each DV b) If CIs overlap then there is no diff, vice versa c) This is instead of Hotelling's t-test, but is very conservative and tends to have low power with small Ns IN PRACTICE NOT TOO MANY PEOPL EUSE THE ABOVE 2 METHODS AND MOST USE UNIVARIATE POST HOC 2) Univariate post-hoc comparisons...after identifying variable that differs across groups, focus on each outcome variable that shows a difference across group separately i) Plot/look at means where each group galls in the outcome variable, where are the differences? *MOST IMPORTANT THING TO DO* ii) There is variety of post-hoc tests that gives you more info on differing groups...t-test on pairs of groups, type I error build up iii) Adjustment for type I error buildup = bonferroni, Sheffe (more conservative, Tukey, Duncan (more power)

PSYCHOMETRICS - SEM Previous comps question: "A primary distinction between CTT and IRT lies in their approaches to characterizing and managing measurement precision. Describe one method of measuring CTT reliability and the IRT information function. Explain the difference in how these two approaches characterize measurement precision. Explain at least one difference in test development that arises from these diff approaches"

SEM (E&R 2001) a) SEM descibes expected score fluctuation due to error. SEM is basic to describing the pscyhometric quality of a test and critical to inddividual score interpretations. The CIs defined by the SEM can guide score interpretations in several ways. Differences between two scores may be interpreted as sig. if their CI bands overlap. b) SEM differs between CTT and IRT in several ways 1) CTT specifies SEM constancy, whereas IRT specifies variability across individuals 2) CTT specifies that SEM is population-specific, whereas IRT specifies that SEM is population-general In IRT, want to pick the items that have the maximum item information functions for a particular ability level and then you won't need as many items in order to estimate someone's ability. Test length and reliability CTT: Longer tests are more reliable than shorter tests (the Spearman-Brown prophecy formula represents this). If a test is lengthened by a facotr of n parallel parts, true variance increases more rapidly than error variance (Guilford, 1954). IRT: Shorter tests can be more reliable than longer tests. In IRT, the claim that longer tests are more reliable is discounted. This is illustrated by comparing the SEM between traditional fixed content tests and adaptive tests. The SEM from the 20 item CAT is lower for more trait levels than from the 30-item fixed content test. This is typical pattern for CAT. The implication of course is that the shorter test yields less measurement error. Thus a more reliable "test" has been developed at a shorter length with items of the same quality (i.e., item discriminations are constant). Thus a composite reliability across trait levels would show the short CAT as more reliable than the longer normal range test.

PSYCHOMETRICS - DIF Previous comps question: "What are some statistical or other issues with DIF?"

STATISTICAL ISSUES 1) Typically the "C" parameters for 2 diff groups are not compared because they are unreliable estimates and can create large diffs between groups 2) DIF methods accept the null: Hypothesis tests are designed to limit Type I errors when researchers want to demonstrate that there is an effect. DIF methods misuse hypothesis testing by assuming that items are not functioning differently, researchers often hope to demonstrate the absence of an effect 3) Power analyses are generally unavailable: can't determine if items that do function differently were correctly identified, but if you don't have power, you can't identify the terms 4) Individual alpha levels ignore the familyy wise type I error rates: This occurs when several comparisons are made among groups. If each comparison is made at a 95% CI, the alpha level is .05. This means that the probability of making a type I error, or incorrectly rejecting the null is .05. Over a series of several comparisons, this is inflated to a much larger level than .05, and therefore, the probability of incorrectly rejecting the null is greater than .05. e.g., if making 6 comparisons among groups, the prob of making at least one error is given by the formula 1-(.95)^c: 1-(.95)^6= .265. Can use other Bonferroni adjusted alpha levels, will decrease power 5) Double- triple-, or even quadruple-jeopardy issues: there are instances when a person belongs in one or more demographic groups and that can become an issue when comparing individuals from diff groups. e.g., a middle-aged caucasian woman can belong to at least 3 diff demographics . It is expected that each group membership may affect the functioning of an item in small ways, so you want to remove items that have a large DIF 6) Effect size vs sig test: A sig. test tells if the results are diff from the null. The effect size tells how diff the results are: howe important the results are, and how they should be interpreted. Effect sizes are also more important in determining the practical sig of the results OTHER ISSUES 1) DIF can be viewed as circular: DIF assumes that most items are not functioning differentially, and based on this assumption, the method identifies items that are 2) DIF explicitly ignores impact (group mean diffs) 3) You can have many DIF items but not DTF 4) DIF Items never explain why an item has diff fxning: methods only flag items that have IDF, but don't give reasons why it functions diff for the focal and reference groups 5) Test misuse may well be a greater issue than DIF / DTF

METHODS - Subjectivity of Factor Analysis Previous Comps Question 1) It has been said that factor analysis is the psychometrician's Rorschach test, because the method relies heavily on the judgment of the researcher. Describe each stage of the factor analysis process where judgment plays a role. Describe methods than can be used to limit the subjectivity at each stage?

Steps of factor analysis Step 1) Correlation matrix - Need high enough correlation to continue with FAs Step 2) Determine number of factors based on criteria a) Scree Plot. The Cattell scree test plots the components as the X axis and the corresponding eigenvalues as the Y axis. As one moves to the right, toward later components, the eigenvalues drop. When the drop ceases and the curve makes an elbow toward less steep decline, Cattell's scree test says to drop all further components after the one starting the elbow. This rule is sometimes criticized for being amenable to researcher-controlled "fudging." That is, as picking the "elbow" can be subjective because the curve has multiple elbows or is a smooth curve, the researcher may be tempted to set the cut-off at the number of factors desired by his or her research agenda. Researcher bias may be introduced due to the subjectivity involved in selecting the elbow. b) Valid imputation of factor labels. Factor analysis is notorious for the subjectivity involved in imputing factor labels from factor loadings. For the same set of factor loadings, one researcher may label a factor "work satisfaction" and another may label the same factor "personal efficacy," for instance. The researcher may wish to involve a panel of neutral experts in the imputation process, though ultimately there is no "correct" solution to this process. Step 3) Extraction method Step 4) Rotation Step 5) Interpretation

PSYCHOMETRICS - Validity SUPER IMPORTANT KNOW THIS COLD Previous comps question: "Describe all types of validity in detail"

The MOST important aspect of any measurement activity TESTS DO NOT "HAVE" VALIDITY, INFERENCES HAVE/LACK DEGREES OF VALIDITY Most psychometric theory deals with reliability, rather than validity Traditionally considered to have at least 3 facets: 1) Content: Does a test cover the domain of KSAOs comprehensively? 2) Criterion-related: Do the test scores correlate highly with important criteria? 3) Construct: Can you show that the test is measuring the KSAOs that it is designed to measure? 4) Face: Does the test seem valid to test-takers and test users? CONTENT VALIDITY: Often evidence comes from the test-development process itself. ***Lawshe (1975) developed the CVR empirical method *using some subjective judgments) ***Can definitely be used to defend selection tests but it is not very empirical ***Is much more a property of the test (in relation to the job/domain) Critierion-related validity: Most "empirical" of the facets: most quantitative methods rely upon this kind of validity *** Some CTT results relate criterion-related validity to reliabilty *** Quite common for one test to be correlated with multiple criteria *** A correction for criterion unreliability may be desired/needed (Rothstein, 1990) rhoTxTy=rhoxy/sqrt rhoxx'*rhoyy' CONSTRUCT: Importance varies, usually look for multiple sources of evidence of construct validity such as theoretical predictions or patterns of correlations FACE VALIDITY: Very frequently the most important aspect of a test; user acceptance can be an overriding consideration; often related to content validity

METHODS - What is the mathematical models and computation procedure of PCA?

fill in AND CHECK Steps in running a principle components analysis (PCA): *Step 1) Look at correlation matrix (is there enough correlation to justify looking for PCs?* a) KMO test for sampling adequacy is an indicator of how well variables are correlated (greater than .60 but more like .80 is good, less than .60 but more like .80 is bad). b) Bartlett's test of sphericity is and indicator of whether or not the correlation matrix is sig. Increased sensitivity with increased N (so KMO is more useful) *Step 2) Determine the # of factors * a) Kaiser criterion: Keep all factors w/ eigenvalues > 1. Can sometimes lead to retaining too many factors. b) Proportion of variance accounted for by PC. Take eigenvalue and write it as a percentage such that variance of the PC/sum all variance (total), increased variance is increased explanatory power. Requires judgment call where you should keep PL (suggested that if less than 5%, it's trivial) c) Total Variance accounted for, keep adding PCs until can explain sufficient amount of data. d) Scree Plot (Cattell) is a graph of eigenvalues, look to see where elbow (point of inflection) occurs. Good when you have clear cut data. Sometimes no natural break may occur. e) Interpretation. Look at solution and try to interpret meaningful factors. Clean pattern of loadings (no cross-loadings!). All variables should be represented, if not, add more actors/PCs. No single variables PCs, not reducing data. Do the PCs make sense? *Step 3) Compute loadings.* In SPSS, extraction sum of square loadings for each PC, square loadings, & add them up. In PC, ESSL = eigenvalue *Step 4) Rotate loadings to get increased interpretation (shift loadings to represent data to make more interpretable). *Takes variance from PC and after rotation, variables introduces to be more evenly distributed across PCs. Basically, Step 4 introduces judgment. *Step 5) Chose the BEST solution. * a) When to interpret loadings (greater than .3) in order to be represented by PC, variable should have a loading higher than .30. b) Steven's sg: When looking at loadings, should consider sample size. Easier to get sig loadings w/ increased N. So compute statistic that gives you a criteria to judge which loadings should be represented. c) Conray & Lee (fill in) d) Communality: Proportion of variance accounted for y PC. Increased communality means being represented well. Decreased communality means not being represented well. Though, this doesn't mean worthless item, just means not measuring what other variables are measuring. Not useful for determining the general PCs.

METHODS - Support for the No Difference between Groups Hypothesis Previous comps question: 1) Based on a review of the literature, you hypothesize that men and women do not differ in leadership style. Why is this hypothesis difficult to test using traditional null hypothesis testing? What alternatives can you suggest that would allow the researcher to conclude that the hypothesis is supported?

Want to test that: H1 = No difference bt men and women. H0 = Men and women differ But can't because in traditional NHST, the null is that there is no difference and the research hypothesis is that there are differences between groups. So even if you FTR the null...you can't say that there are no differences. The most common statistical procedure is to pit a NULL hyp. (H0) against an alternative hyp. (H1) 1) Null hyp. usually refers to the "no differences" or "no effect" 2) Null hyp. is a specific statement about results in a population that can be tested (and therefore nullified) 3) One reason that null hypotheses are often framed in terms of "no effect" is that the alternative that is implied by the hypothesis is easy to interpret. Thus, if researchers test and reject the hyp. that treatments have no effect, they are left with the alternative that treatments have at least some effect. 4) Another reason why testing the hyp. that treatments have no effect is that probabilities, test statistics, etc. are easy to calculate when the effect of treatments is assumed to be nil. 5) In contrast, if researchers test and reject the hyp. that the difference between treatments is a certain amount (e.g, 6 points), they are left with a wide arrange of alternatives (e.g., diff is 5 points, 10 points, etc.) including the possibility that there is NO difference. ALTERNATIVES *CIs:* Does the CI include 0? If so, then you can assume that there is no difference. What does the interval tell you about your results? *READILY PROVIDES SIGNIFICANCE TESTING INFORMATION along with a range of plausible population values for a parameter.* IF the null-hypothesized value of the parameter (often, zero) does not fall within the calculated CI, the null hypothesis can be rejected at the specified alpha level. *Therefore, CIs provide more info than the conventional null hypothesis sig. test procedure.* *1) Definition of CI:* The interval within which the population value of the statistic is expected to lie with a sepcified confidence (1-alpha). If repeated samples were taken from the population and a CI for the mean were constructed from each sample, (1-alpha) of them would contain the population mean (Cohen et al, 2003). The CI also provides a rough and easily computed index of power, with narrow intervals indicative of high power and wide intervals of low power (Cohen, 1994). Effect Size: Does the effect size of leadership style remain the same for men and women? If so, then can show that there is no difference between men and women on this variable. *What is the relationship between "effect size" and "significance"?* 1) Effect size quantifies the size of the difference between two groups and may therefore be said to be a true measure of the significance of the difference. 2) The statistical significance is usually calculated as a 'p-value' the probability that a difference of at least the same size would have arisen by chance, even if there really were no difference between the two populations. 3) There are a # of problems with using 'significance tests' in this way (see, for example, Cohen, 1994; Harlow et al 1997, Thompson, 1999). The main one is that the p-value depends essentially on two things: the size of the effect and the size of the sample. One would get a 'significant' result either if the effect were very big (despite having only a small sample) or is the sample were very big (even if the actual effect size were tiny). 4) It is important to know the statistical significance of a result, since without it there is a danger of drawing firm conclusions from studies where the sample is too small to justify such confidence. However stat. sig. does not tell you the most important thing: which is THE SIZE OF THE EFFECT. One way to overcome this confusion is to report the effect size, together with an estimate of its likely 'margin for error' or 'confidence interval.'

METHODS - Quasi-Experimental Design: Appropriateness of ANCOVA Previous comps question: "Training evaluation research often utilizes quasi-experimental designs, where it is not possible to randomly assign participants to treatment conditions. In these cases, researchers often utilize Analysis of Covariance (ANCOVA) to control for pre-existing group diffs. Describe how you would perform this analysis using hierarchical multiple regression. Why do some critics warn against the use of ANCOVA to control for pre-existing group diffs?"

*Quasi-experimental design: Experiments that do not use random assignment to create the comparisons from which treatment-caused changed is inferred* FIRST WHAT IS ANCOVA? "ANCOVA is used to test effects of categorical predictor on a continuous outcome variable, controlling for the effects of other continuous variables (covariates). ANCOVA uses builtin regression using covariates to predict the DV, then does an ANOVA on the residuals to see if the factors are still sig. related to the DV after the variation due to the covariates has been removed. Criticism of using ANCOVA to control for pre-existing differences in quasi-experiments: Controlling for pre-existing diffs: Can ANCOVA be modeled using regression? 1) YES. If dummy variables are used for the categorical IVs. When creating dummy variables, one must use one less category that there are values of each IV. For full ANCOVA, one would also add the interaction crossproduct terms for each pair of IVs included in the equation, including the dummies. Then one computes multiple regression. The resulting F tests will be the same as in classical ANCOVA. The ANCOVA statistical model seemed at first glance to have all of the right components to correctly model data from the NEGD. But we found that it didn't work correctly - the estimate of the treatment effect was biased. When we examined why, we saw that the bias was due to two major factors: The attenuation of slope that results from pretest measurement error coupled with the initial nonequivalence between the groups. The problem is not caused by posttest measurement error because of the criterion that is used in regression analysis to fit the line. It does not occur in randomized experiments because there is no pretest non-equivalence. We might also guess from these arguments that the bias will be greater with greater nonequivalence betweeng roups. The less similar the groups, the bigger the problem. In real-life research, as opposed to simluations, you can count on measurement error on all measurements - we never measure perfectly. So in nonequivalent groups designs we now see that the ANCOVA analysis that seemed intuitively sensible can be expected to yield incorrect results.

METHODS - Factor Loadings - (Definition) Prior Comps Questions (1 question split up) 1) What is a factor loading? 2) How are loadings used in the interpretation of a factor analysis? 3) With an oblique rotation, loadings can be obtained from either the pattern matrix or the structure matrix. Which is more useful for interpretation? Why?

*RESPONSE TO Q1* *Factor loadings* (also called component loadings in PCA), are the correlation coefficients between the variables (rows) and factors (columns). Analogous to Pearson's r, the squared factor loading is the percent of variance in that variable explained by the factor. To get the percent of variance in ALL the variables accounted fro by each factor, add the sum of the squared factor loadings for that factor (column) and divide by the number of variables. (Note the number of variables equals the sum of their variances as the variance of a standardized variable is 1). This is the same as dividing the factor's eigenvalue (proportion of variance) by the number of variables. *RESPONSE TO Q2* *How are loadings used?* The sum of the squared factor loadings for all factors for a given variable (row) is the variance in that variable accounted for by all the factors, and this is called the communality. In a complete PCA, with no factors dropped, this will be 1.0, or 100% of the variance. The ratio of the squared factor loadings for a given variable (row in the factor matrix) shows the relative importance of the different factors in explaining the variance of the given variable. Factor loadings are the basis for imputing a label to the different factors. *RESPONSE TO Q3* *With an oblique rotation, loadings can be obtained from either the pattern matrix or the structure matrix, which is more useful for interpretation?* If an oblique rotation (so that the factors themselves are correlated), several additional matrices are produced. 1) The factor correlation matrix contains the correlations among the factors 2) The loading matrix from orthogonal rotation splits into two matrices for oblique rotation a) A structure matrix of correlations between factors and variables b) A pattern matrix of unique relationship (uncontaminated by overlap among factors) between each factor and each observed variable 1) The meaning of factors is ascertained from the pattern matrix 3) There is some debate as to whether one should interpret the pattern matrix of the structure matrix following oblique rotation: a) The structure matrix is appealing because it is readily understood and shows the total % variance due to factor (sum of PV greater than total variance explained). However, the correlations between variables and factors are inflated by any overlap between factors. The problem becomes more severe as the correlations among factors increase and it may be hard to determine which variables are related to a factor (T & F, 2007) b) The pattern matrix contains values representing the *unique contributions of each factor to the variance in the variables* (aka shows "unique % variance due to factor and the sum of PV is less than total variance explained). Shared variance is omitted (as it is with standard multiple regression), but the set of variables that composes a factor is usually easier to see. If factors are very highly correlated, it may appear that no variables are related to them because there is almost no unique variance once overlap is omitted. c) Most researchers interpret and report the pattern matrix rather than the structure matrix.

NOT STUDYING METHODS - Hierarchical Cluster Analysis Previous comps question: "How would you use hierarchical (agglomerative) cluster analysis to identify job families based on responses to an organization-wide job analysis questionnaire? Describe the basic data used, the agglomeration procedure, and the interpretation of results.

*NOT STUDYING HIERARCHICAL CLUSTER ANALYSIS*

METHODS - *Hierarchical* and *Stepwise Regression* Previous comps question: "Both hierarchical and stepwise regression can be used to build regression models. Describe and distinguish between these two approaches. What are the advantages and disadvantages of each approach? For what situations would each approach be most appropriate?"

Cohen et al (2003) *Stepwise Multiple Regression* STEPS Also called statistical regression, is a way of computing OLS regression in stages. 1) In stage one, the IV best correlated with the DV is included in the equation. 2) In the second stage, the remaining IV with the highest partial correlation with DV, controlling for the first IV, is entered. 3) The process is repeated, at each stage partialling for previously entered IVs, until the addition of a remaining IV does not increase R-squared by a significant amount (or until all variables are entered, of course). Alternatively, the process can work backward, starting with all variables and eliminating IVs one at a time until the elimination of one makes a significant difference in R-squared. In SPSS, select Analyze, Regression, Linear: set the Method: box to Stepwise. *Stepwise Multiple Regression* THEORY Stepwise regression is used in the exploratory phase of research or for purposed of pure prediction, not theory testing. In the theory testing stage, the researcher should base selection of the variables and their order on theory, no on a computer algorithm. Menard (1995) writes, "there appears to be general agreement that the use of computer-controlled stepwise procedures to select variables is inappropriate for theory testing because it capitalizes on random variations in the data and produces results that tend to be idiosyncratic and difficult to replicate in any sample other than the sample in which they were originally obtained.: Likewise, the nominal .05 significance level used at each step in stepwise regression is subject to inflation, such that the real significance level by the last step may be much worse, even below .5, dramatically increasing the chances of Type I errors (Draper et al, 1979). Fox (1991) strongly recommends any stepwise model be subjected to cross-validation. *Stepwise Multiple Regression* PROBLEMS By brute force fitting of regression models tothe current data, stepwise methods can overfit the data, making generalization across data sets unreliable. A corollary is that stepwise methods can yield R-squared estimates which are substantially too high, significance tests which are too lenient (allow Type 1 error), and confidence intervals that are too narrow. Also stepwise methods are even more affected by multicollinearity than regular methods. *Stepwise Multiple Regression* DUMMY VARIABLES Note that if one is using sets of dummy variables, the stepwise procedure is performed by specifying blocks of variables to add. However, there is no automatic way to add/remove blocks of dummy variables; rather SPSS will treat each dummy as if it were an ordinary variable. That is, if using dummy variables one must run a series of manually-created equations which add/remove sets of dummy variables as a block. ******************************************************** *Hierarchical Multiple Regression* STEPS Not to be confused with hierarchical linear models, hierarchical multiple regression is similar to stepwise regression, but the researcher, not the computer, determines the order of entry of the variables. F-tests are used to compute the significance of each added variable (or set of variables) to the explanation reflected in R-squared. This hierarchical procedure is an alternative to comparing betas for purposes of assessing the importance of the IVs. 1) How to compute: The IVs are entered cumulatively in a prespecified sequence and the R-squared and partial regression and correlation coefficients are determined as each IV joins the others. 2) Basic principles underlying the hierarchical order for entry are: Causal priority and the removal of confounding or spurious relationships, research relevance, and structural properties of the research factors being studied. 3) In practice, it is almost always preferable for the researcher to control the order of entry of the predictor variables. The double advantage of hierarchical methods over stepwise methods is that there is less capitalization on chance, and the careful researcher will be assured that results such as R-squared added are interpretable. Stepwise methods should be reserved for exploration of data and hypothesis generation, but results should be interpreted with proper caution. For any particular set of variables, multiple R and the final regression equation do not depend on the order of entry. Thus the regression weights in the final equation will be identical for hierarchical and stepwise analyses after all of the variables are entered. At intermediate steps, the B and beta values as well as the R-squared added, partial and semipartial correlations can be greatly affected by variables that have already entered the analysis.

METHODS - Correlation Does Not Equal Causation Previous comps question: "A researcher has conducted a study on the relationship between self-reported stress and typing performance. The study reports a sig. neg. corr. for men (p< .01) but that the corr. for women is nonsig. (p=.3) from this results, can you conclude that the r-ship is stronger for men than for women? Explain" "Explain the statement "correlation does not imply causation" Discuss how both experimental procedures and statistical control can be used to strengthen causal inferences"

Correlation does not imply causation X corr. w. Y Y corr. w. X X corr. w. Z which corr. w. Y Spurious *X corr. with Y, want to say x causes y, but can't bc z may be causing x and y. If want to say x causes y, must first rue out that y causes x and that there is not a common cause* *Use experimental design to strengthen inferences via: control group, matching, random assignment, manipulation, control of environment (procedures are standardized: control environment to minimize extraneous variables* Ruling out common cause: 1) Experimental control - control the environment in which research happens 2) Random assignment - control fro extraneous variables 3) Include z in the model - statistical control measure z, include in model, remove its effect 1) 1 and 3, assumes z is known 2) Problem with 3 is that have to include all relevant variables, may be hard to do Statistical control: Measure confounding variables 1) Need to know ahead of time what the important factors are based on theory and previous research. 2) Statistically partial out effects of those variables and see what's left after removal of these influences. Statistical control and how to do it: 1) Partial correlation a)Ry.z, partial correlation bt x and y while controlling for z b) What is the r-ship bt 2 variables before and after removing the effects of the other (holding 3rd constant) c) Ex. Interested in r-ship bt ice cream sales and violent crimes, if remove effect of temp, what would the r-ship bt ice cream and violent crimes be? d) Another way: 1) Say ice cream sales = b0+b1tempt+e1 (predict ice cream as a function of temp) 2) Crime = b0+b1temp+e2 (how will temp explain crime) e2 is residual variance in crime unrelated to temp (residual scores, numerical way of expressing venn diagram, error variance should be smaller than the total variance. e)Computing partial correlation removes shared covariance and leaves with unique covariance (numerator reduces covariance, denominator reduces variance). Partial corr changes both variance and covariance. Disadvantage of partial corr - only gives ability to predict residual variance, not total, may not reflect r-ship with outcome variable. 2) SEMIpartial correlation a) R-ship bt 2 variables after influence of 3rd has been removed from only ONE of them b) Corr bt x and y after influence of z has been removed from x c) Ability of x to predict y above prediction gotten from z d) Useful with incremental predictions, looking for actual not residual outcomes e) Ry(x.z) x controlling for z, taking z out of x f) Represents corr of variables to outcome above other predictors g) Full model: y1 = b0+b1x+b2z = Ryxz h) Reduced: y1 = b0+b2z = Ry.z i) Difference bt these two represents the contribution of x *What is the difference bt partial and semi, how do you go about redefining variance?* 1) In Partial, redifine variance of both x and y (what variance would be if z held constant for both), variance should shrink (USE WHEN INTERESTED IN STATISTICAL CONTROL) 2) In SEMIpartial, only redefine variance of x (USE WHEN WANT INCREMENTAL). Standardizing regression coefficients (beta), more similar to SEMIpartial correlation, same just not taking square root, another way to statistically control for z. *Statistical control* 1) Relative importance of predictors - which predictors are most important? 2) Y = .2Introversion - .4Tolerance - .4 Exercise 3) Then look at correlation between each variable and predictor may find that exercise is best predictor 4) Regression coefficient does not represent relative importance, tell us the unique contribution of variables above other variables in the equation 5) Regression coefficients are interpreted as unique contribution not relative importance 6) In order to understand importance, must look at reg coefficient and zero order correlation *Significance of the difference bt two correlations from two independent samples* To compute the significance of the difference bt 2 correlations from ind. samples, such as a corr. for males vs a corr. for females, follow these steps: 1) Use the table of z-score conversions or convert the two correlations to z-scores, as outlined above. Not that if the correlation is negative, the z-value should be negative. 2) Estimate the standard error of difference bt the two correlations as: SE = SQRT[(1/nsub1-3)+(1/(nsub2-3)] Where n1 and n2 are the sample sizes of the two independent samples. 3) Divide the difference bt the two z-scores by the standard error 4) If the z value for the difference computed in step 3 is 1.96 or higher, the difference in the correlations is significant at the .05 level. Use a 2.58 cutoff for significance at the .01 level. EXAMPLE. Let a sample of 15 males have a correlation of income and education of .60, and let a sample of 20 females have a correlation of .50. We wish to test if this is a significant difference. The z-score conversions of the two correlations are .6931 and .5493, respectively, for a difference of .1438. The SE estimate is SQRT[1/12)+(1/17)]=SQRT[.1422]=.3770. The z value of the difference is therefore .1438/.3770=.381, much smaller than 1.96 and thus not significant at the .05 level (Hubert Blalock (1972)).

PSYCHOMETRICS - IRT Models 1) What types of IRT models are there?

Dichotomous vs. polytomous 1) Poly -Typically model each response -Models exist for nominal responses and ordinal responses 2) Di -Only the correct response is modeled -All wrong responses are lumped together -Not appropriate for Likert items Unidimensional vs. multidimensional: 1) Multi -Are basically non-linear factor analysis models -Assume two or more underlying dimensions -Rarely used because of large data requirements 2) Uni -Are basically non-linear factor analysis models that assume a single latent factor -Are far more common

METHODS - Multicollinearity Previous Comps Question 1) What is multicollinearity? Describe in detail how multicollinearity affects the results and interpretation of a multiple regression analysis. 2) What problems does multicollinearity create for MR analyses? Is it a problem in other types of analyses (e.g., ANOVAs or Factor Analysis)? What can be done to overcome these problems?

Definition of multicollinearity: 1) With multicollinearity, the variables are very highly correlated (say, .90 and above) a) For example, scores on the Wechsler Adult Intelligence Scale and scores on the Stanford-Binet Intelligence Scale are likely to be multicollinear because they are two similar measures of the same thing (T&F, 2007) b) When variables are multicollinear, they contain redundant information and they are not all needed in the same analysis (T&F, 2007) 2) This is a problem that occurs in datasets in which one or more of the IVs is highly correlated with other IVs in the regression equation (T&F, 2007) 3) Logical problem: Unless you are doing analysis of structure (factor analysis), it is not a good idea to include redundant variables in the same analysis. They are not needed and, because they inflate the size of error terms, they actually weaken an analysis (T&F, 2007) 4) Statistical problems of multicollinearity occur at much higher correlations (.90 and higher). The problem is that multicollinearity renders unstable, matrix inversion. Matrix inversion is the logical equivalent of division; calculations requiring divisions cannot be performed on singular matrices because they produced determinants equal to zero that cannot be used as divisors. Multicollinearity often occurs when you from cross-products or powers of variables and include them in the analysis along with the original variables, unless steps are taken to reduce the multicollinearity (T&F, 2007). *What problems does it create in analyses?* 1) FACTOR ANALYSIS. Too high intercorrelations may indicate a multicollinearity problem and collinear terms should be combine or otherwise eliminated prior to factor analysis. Singularity in the input matrix, also called an ill-conditioned matrix, arises when 2 or more variables are perfectly redundant. Singularity prevents the matrix from being inverted and prevents a solution. a) KMO statistics may be used to address multicollinearity in a factor analysis , or data may first be screened using VIF or tolerance in regression. Some researchers require correlations greater than 3.0 to conduct factor analysis. 2) ANOVA. Unless you are dealing with repeated measures of the same variable (as in various forms of ANOVA), think carefully before including two variables with a bivariate correlation of say, .70 or more in the same analysis (T&F, 2007). You might omit one of the variables or your might create a composite score from the redundant variables. a) In most ANOVA designs, it is assumed the independents are orthogonal (uncorrelated, independent). This corresponds to the absence of multicollinearity in regression models. If there is such lack of independence, then the ratio of the between to within variances will not follow the F distribution assumed for significance testing. If all cells in a factorial design have approximately equal numbers of cases, orthogonality is assured because there will be no association in the design matrix table. In factorial designs, orthogonality is assure by equalizing the number of cases in each cell of the design matrix table, either through original sampling or by post-hoc sampling of cells with larger frequencies. Note, however, that there are other designs for correlated independents, including repeated measures designs, using different computation. 3) ANCOVA. Is sensitive to multicollinearity among the covariates and also loses statistical power as redundant covariates are added to the model. Some researchers recommend dropping from analysis any added covariates whose squared correlation with prior covariates is .50 or higher. 4) REGRESSION. Multicollinearity often occurs when you form cross-products or powers of variables and include them in the analysis along with the original variables , unless steps are taken to reduce the multicollinearity (T&F, 2007)l. MC occurs when highly related IVs are included in the same regression model (Cohen et al, 2003). a) With MC, the determinant is not exactly zero, but is zero to several decimal places. Division by a near-zero determinant produces very large and unstable numbers in the inverted matrix. The sizes of numbers in the inverted matrix fluctuate woldyly with only minor changes in the sizes of the correlations in R. The portions of the multivariate solution that flow from an inverted matrix that is unstable are also unstable. In regression, for instance, error terms get so large that non of the coefficients are significant (Berry, 1993). For example, when r is .90, the precision of estimation of regression coefficients is halved (Fox, 1991). b) MC can also lead to complexities in interpreting regression coefficients (Cohen et al, 2003) 1) In regression, MC is signaled by very large (relative to the scale of the variables) standard errors for regression coefficients (T&F, 2007). Berry (1993) reports that when r is .90, the standard errors for the regression coefficients are doubled; when MC is present, none of the regression coefficients may be significant because of the large size of standard errors. Even tolerances as high as .5 or .6 may pose difficulties in testing and interpreting regression coefficients (T&F, 2007).

METHODS - Discriminant Analysis and Cluster Analysis Previous comps question: "Both discriminant analysis and cluster analysis are methods for understanding differences among groups. Compare and contrast these two type of analyses. When should each be used? Give a concrete example of how each could be used in the context of I/O Psychology"

Discriminant Analysis Model Overview Discriminant Analysis (DA) is used to classify cases into the values of a categorical dependent, usually a dichotomy. If DA is effective for a set of data, the classification table of correct and incorrect estimates will yield a high percentage correct. There are EIGHT purposes for DA: 1) To classify cases into groups using a discriminant prediction equation. 2) To test theory by observing whether cases are classified as predicted. 3) To investigate differences between or among groups. 4) To determine the most parsimonious way tot distinguish among groups. 5) To determine the percent of variance in the DV explained by the IVs. 6) To determine the percent of variance in the DV explained by the IVs over and above the variance accounted for by control variables, using sequential DA. 7) To assess the relative importance of the IV in classifying the DV. 8) To discard variables which are little related to group distinctions. DA has TWO steps: 1) an F test (Wilks' lambda) is used to test if the discriminant model as a whole is significant, and 2) If the F test shows significance, then the individual IVs are assessed to see which differ significantly in mean by group and these are used to classify the DV. DA shares all the usual assumptions of correlation, requiring linear and homoscedastic relationships, and untruncated interval or near interval data. Like multiple regression, it also assumes proper model specification (inclusion of all important IVs and exclusion of extraneous variables). DA also assumes the DV is a true dichotomy since data which are forced into dichotomous coding are truncated, attenuating correlation. DA is an earlier alternative to logistic regression, which is now frequently used in place of DA as it usually incolves fewer violations of assumptions (IVs needn't be normally distributed, linearly related, or have equal within-group variances), is robust, handles categorical as well as continuous variables, and has coefficients which many find easier to interpret. Logistic regression is preferred when data are not normal in distribution or group sizes are very unequal. However, DA is preferred when the assumptions of linear regression are met since DA has more statistical power than logistic regression (less chance of Type 2 errors - accepting a false null hypothesis). See also the separate topic on multiple discriminant function analysis (MDA) for DVs with more than 2 categories. *"Isn't DA the same as cluster analysis?"* NO. In DA, the groups (clusters) are determined beforehand and the object is to determine the linear combination of IVs which best discriminates among the groups. In cluster analysis (CA), the groups (clusters) are not predetermined and in fact the object is to determine the best way in which cases may be clustered into groups. Cluster Analysis Model Overview: Cluster Analysis (CA) seeks to identify homogeneous subgroups of cases in a population. That is CA seeks to identify a set of groups which both minimize within-group variation and maximize between-group variation. Hierarchical clustering allows users to select a definition of distance, then select a linking method for forming clusters, then determine how many clusters best suit the data. In k-means clustering the reseracher specifies the numbers of clusters in advance, then calculates how to assign cases to the K clusters. K-means clustering is much less computer-intensive and is therefore sometimes preferred when datasets are very large (e.g., > 1000). Finally, two-step clustering creates pre-clusters, then it clusters the pre-clusters.u

METHODS - Type I Error Rate If you conduct a single hypothesis test with an alpha of .05 your Type I error rate is .05. Conceptually, what is the family-wise Type I error rate over 10 such tests? To control the family-wise error rate would you: 1) increase alpha, 2) decrease alpha, or 3) do something else? Explain.

Family-wise Type I Error Rate (FWER) is the probability of making one or more false discoveries (Type I errors) among the hypotheses. Type I error is the kind of error when you reject the null hypothesis when in fact it is true and is designated as alpha. In multiple comparison procedures, FWER is the p is the probability that, even if all samples come from the same population, you will wrongly conclude that at least one pair of populations differ. If alpha is the probability of comparison-wise Type I Error, then the probability alpha (sub FWER) is usually calculated as follows: alpha (subFWER) = 1-(CL)^ c where alpha(subFWER) is the FWER, alpha is the alpha rate for an individual test (almost always considered to be .05) and c is the # of comparisons. c as used in the formula is an exponent, so the parenthetical value is raised to the cth power. ANSWER TO QUESTION: *If we conducted 10 tests, there is a 40% chance of one of the tests being significant even when the null is true* To control the FWER, we: *1)* WOULD NOT increase alpha, this will increase our FW alpha even more! *2)* COULD decrease alpha, as it helps to reduce the FWER. If decreasing alpha to .01 then there is a 9% chance of one of the tests being significant even when the null is true. Which is still high, but much smaller than the original FWER that was calculated at 40% with 10 comparisons) *3)* We COULD do something else (i.e., conduct a post-hoc analysis). The Bonferroni simply calculates a ned pairwise alpha to keep the familywise alpha value at .05 (or another specified value). The formula for doing this is as follows: Where... alpha (sub Bonferroni) = alpha(sub FWER)/c ...in which *alpha (sub b)* is the new alpha based on the Bonferroni test that should be used to evaluate each comparison or significance test, and *alpha (sub FWER)* is the FWER as computed in the first formula, and *c* is the number of comparisons (statistical tests). a) ADVANTAGE: The Bonferroni is probably the most commonly used post-hoc test, because it is highly flexible, very simple to compute, and can be used with any type of statistical test (e.g., correlations) - not just post-hoc tests with ANOVA. b) DISADVANTAGE: The traditional Bonferroni, however, tends to lack power. The loss of power occurs for several reasons: 1) The FWER calculation depends on the assumption that for all tests, the null hypothesis is true. This is unlikely to be the case, especially after a significant omnibus test; 2) All tests are assumed to be orthogonal (i.e., independent or non-overlapping) when calculating the FWER test, and this is usually not the case when all pairwise comparisons are made; 3) The test does not take into account whether the findings are consistent with theory and past research. If consistent with previous findings and theory, an individual result should be less likely to be a Type I error; and 4) Type II error rates are too high for individual tests. In other words, the Bonferroni overcorrects for Type I error, which is a major weakness of the Bonferroni.

PSYCHOMETRICS - IRT (parameters and models) Previous comps question: "Describe the c parameter of the 3 parameter logistic model, including the definition, interpretation, and the effect on information. Briefly discuss how item writers can avoid a high c parameter"

Pi(theta)=ci+((1-ci)/(1+e^-Da(theta-b))) MARISA HAVE ME WRITE IT OUT "Theta" is ability, generally expressed as a standard, normal deviate -3 to 3 "a" is the slope of the ascending portion of the ICC, usually 0 < a < 3. ANALOGOUS to CTT biserial item-total correlation and a loading in factor analysis. "b" is the threshold or difficulty (or location) on the same metric as theta. ANALOGOUS to the proportion correct in CTT "c" is the "pseudo-guessing" left asymptote, strictly 0 <= c < 1 "D" is a constant about 1.7 which is used to scale the logistic model so that it's comparable to the normal CDF DEEP DIVE INTO C PARAMETER[= WHAT IS C? -The parameter c is called the pseudo-chance-level parameter. There parameter provides a (possibly) non-zero lower asymptote for the item characteristics curve and represents the probability of examinees with low ability answering the item correctly (Hambleton et al 1991). - The parameter c is incorporated into the model to take into account performance at the low end of the ability continuum, where guessing is a factor in test performance on multiple choice items -Typically c assumes values that are smaller than the value that would result if examinees guessed randomly on the item -Lord (1974) has noted, this phenomenon can be attributed to the ingenuity of item workers in developing attractive but incorrect choices. For this reason, c should not be called the "guessing parameter." -Estimates of lower asymptote from the 3 PL model often differ from the random guessing probability. For example, if examinees can systematically eliminate implausible distracters, selecting the correct answer from the remaining alternative will have a higher probability than random guessing (Embretson & Reise, 2000) -As the ci parameter becomes larger, the information function becomes lower (i.e., guessing makes test scores less precise) and more asymmetric (i.e., relatively lower information for lower abilities) So...as c goes up, IIF (item information function) goes down (test scores are less precise and more asymmetric) WHY 3PL? -The 3PL model adds a parameter to represent an item characteristics curve that does not fall to zero. e.g., when an item can be solved by guessing, as in MC items, the probability of success is substantially greater than zero, even for low trait levels. The 3PL model accommodates guessing by adding a lower-asymptote parameter, c -The 3PL model with unique lower asymptotes for each item can lead to estimation problems. To avoid such estimation problems, a common lower asymptote is estimated for all items or for groups of similar items (E&R, 2000) -In the 1PL and 2PL models, the IIF is maximized at the point of inflection on the ICC (where the prob. of answering the question correctly is .500 this point represents the difficulty parameter). However, in the 3PL model IIF is maximized at a level which is below the item-difficulty parameter because the point of inflection on the ICC no longer = .50. The inflection point is shifted by the lower asymptote. -In the 3PL model, the maximum amount of item-information occurs at a trait level slightly below the item-difficulty parameter. The effect of the guessing c parameter is o lower the amount of psychometric info an item provides. HOW CAN WE AVOID A HIGH C PARAMETER? -Goal is to minimize he effect of guessing, want the c parameter to be minimized. In order to accomplish this, we want our c parameter to be lower than chance. This means that one of the incorrect options included in the test will be highly attractive and individuals with low ability will be likely to choose this option than the correct answer. (c < chance) -On the other hand, if alternatives included in the test are obviously incorrect choices and individuals with low ability are able to easily eliminate them, the guessing parameter will be above chance. Therefore, this should be avoided in order to decrease the impact of the guessing parameter. (c>chance)

PSYCHOMETRICS - How to choose which IRT model? 1) Name the 3 criteria

1) Several criteria can be applied to determine which model is the best (E&R, 2000) and include: 1) The weights of items for scoring (equal vs. unequal) 2) The desired scale properties for the measure 3) Fit to the data **IRT provides models of the data. Models are always abstractions of reality. If the abstractions are too far removed from the reality, then any conclusions drawn from the models are, essentially, invalid **Thus, it is critical to assess the fit between the model and the data **Fit as generally assessed through some sort of residual analysis or by testing an assumption a) Examine estimation iteration history assess reasonableness of parameter esstimates (when IRT estimation "goes badly" there is some tendency for really ridiculous results to occur) b) Root Mean Square Error RSME = sqrt(sum(observed-expected/predicted)**2) -Extremely general and handy for quantifying residuals -Sometimes difficult to interpret because it is in the metric of the parameter c) Chi-square goodness of fit -Only good for medium to long tests (n>=20) -Dependent on sample size d) "fit" plots IF CAN'T REMEMBER ALL THE ABOVE, this is the summary of how to choose a model: 1) Might simply prefer 1 model, perhaps due to a modeling issue (e.g., you want to model guessing 2) You may statistically test the incremental fit of additional parameters and thereby determine which model fits the best, by evaluating the significance of the chance in the log likelihood fxn as a chi-square with df = to the difference in parameters (generally the # of items) 3) Look at residual plot, if residuals decrease as you include additional parameters, then this supports a higher model (2PL or 3PL). Point is...LOOK AT FIT PLOTS. Difference between predicted and observed should be small. 4) Small samples may force you to adopt a simpler model or use very strong Bayesian priors to "fix" some parameters rather than estimate them.

PSYCHOMETRICS - SEM 1) In IRT, SEM is _________ across range of theta. 2) SEM gets _________ as I(theta) is larger. 3) If I(theta) is bell-shaped, SEM will be ____________.

1) not uniform 2) smaller 3) u-shaped

METHODS - Factor Analysis of Job Analysis Ratings (Cranny & Doherty, 1988)

A common practice in JA involves having SMEs provide importance weights for the behaviors identified for a given job, and then grouping those behaviors by PFA - this is inappropriate. PFA has been inappropriately used to assess the differences between individuals on JA data. Results involving importance of tasks has tended to be spurious. PFA should be used to reduce data. FA to assess differences bt individuals based in JA data is inap1) Principle Components Model: By far the most common form of factor analysis, PCA seeks a linear combination of variables such that the maximum variance is extracted from the variables. It then removes this variance and seeks a second linear combination which explains the maximum proportion of the remaining variance, and so on. This is called the prcincipal axis methid and results in orthogonal (uncorrelated facors) PCA analyzes total (common and unique) variance. Example, component 1 increases standard variance, but then so does component 2. Results in uncorrelated factors. 2) Principle Factor Model: Also call principal axis factoring (PAF), and common factor analysis, PAF is a form of factor analysis which seeks the least number of factors which can account for the common variance (correlation) of a set of variables, whereas the more common pCA in its full form seeks the set of factors which can account for all the common and unique (specific plus error) variance in a set of variables. PAF/PFA uses a PCA strategy but applies it to a correlation matrix in which the diagonal elements are not 1's, as in PCA, but iteratively-derived estimates of the communalities (R-squared of a variable using all factors as predictors). Example, only considers common variance and seeks out least # of variables which can explain the common variance. Inappropriate for the following reasons. 1) Factor represent common patterns of disagreement in assessment of importance only way there can be a correlation among importance ratings is if raters disagree about the importance of items being rater. Tells us nothing about agreement. 2) Items with absolute agreement will exhibit little to no variance, don't correlate w/ others behaviors and won't lend to any factor - these are the items which are of most concern. Variance in importance ratings means disagreement in importance and is undesirable.

METHODS - A consultant has conducted a job analysis survey in which task importance ratings were obtained from 500 parol officers from a large metropolitan police department. In order to make the results more manageable, the consultant conducted a principle components analysis (PCA) that identified 5 key job dimensions. These dimensions were then used to identify the key KSAOs required by the job. Cranny & Doherty (1988) argue that factor analysis is inappropriate with this kind of data. Why does this type of data create problems for factor/component analysis? Would you have the same problem if the survey assessed job satisfaction?

What Cranny & Doherty say (straight from 1988) article) There are *four sources of variance in importance ratings*: 1) Differences in job content with which different SMEs are familiar 2) Differences in value systems among SMEs 3) Response bias differences among SMEs 4) Unreliability Thus, correlations computed on such data can tell us nothing about similarities or differences among behaviors on any relevant bases, but only about the extent to which SMEs (who, for job description purposes, should agree) share similar patterns of disagreement. That is, the sources of variance necessary to support the desired interpretation of factors or clusters as important dimensions of job behavior are not present in data that consist of importance ratings by SMEs on a single job. It is this fact, and not merely the statistical properties of the correlation coefficient (discussed later) that is our major point. *Items with no variance* Of course, the SMEs can have uniform values in some areas and different value patterns in others. Those behaviors that SMEs agree are important will exhibit little or no variance in importance ratings, and cannot correlate highly with other behaviors, regardless of content. The same is true for items that SMEs agree are unimportant. By virtue of the fact that all raters agree that they are very important, items 3 and 4 of our hypothetical dataset have no variance, correlate zero with every other item, and will be dropped from further consideration because they will load on no factor. Items 4 and 5 ought to be dropped and will be dropped, but for the wrong reason. So those behaviors with which we should be concerned - those that our SMEs agree are very important - are certain to be lost if we depend on the results of correlational analyses of the kind cited and presented in their pure form here. As Schmitt (1987) pointed out, there is a need for much more research on which to base recommendations about methods of linkage between job analysis and criterion measures and between job analysis and hypotheses about predictors. A number of methods of grouping job behaviors are available. The grouping could be done by the job analyst into functional categories or on the basis of hypothesized common skills or abilities. Or perhaps the important behaviors could be allocated and retranslated to categories by SMEs (Smith & Kendall, 1963). A number of other empirical approaches are also available. Similarity ratings could be obtained directly from SMEs (or with sorting procedures), and these data could be subjected to multidimensional scaling, cluster analysis, or covariance algorithms such as factor analysis. Undoubtedly other techniques might apply as well, but the use of factor analysis of importance ratings in job analysis is totally inappropriate, as are any techniques that rely on covariances among item importance ratings. The use of such techniques in this manner reflects an unfortunate tendency to use the most powerful statistical techniques available without careful consideration of whether they are appropriate for the purpose at hand. Whatever specific methods are used, the users should keep clearly in mind the elementary meaning of the numbers. Johnson (2000) An article by Cranny and Doherty has inadvertently created a common perception among researchers that factor analysis should never be applied to importance ratings collected in a job analysis. The author argues that factor analysis of job analysis of importance ratings is appropriate ONLY in the following conditions: 1) Meaningful differences exist bt job or positions analyzed 2) Each variable has some interrater agreement within jobs and some variance across jobs or positions 3) The research objective is to find a set of factors that reflect shared patterns of the extent to which the tasks, behaviors, or characteristics are relevant to different positions *WHAT SHOULD BE DONE???* 1) Discard behaviors with low means 2) Items w/ HIGH means, LOW SDs, represent job content SMEs agree about and should be grouped 3) Best to use logical rather than statistical groupings 4) Collect similarity ratings and do cluster analysis, FA, or multidimensional scaling

PSYCHOMETRICS - 2 PL Model Describe in detail and note what is different from 3 PL Model and 1 PL Model

2 PL Model is the same as 3 PL except now c=0 so there is no guessing parameter in the model (that it is unimportant or undesirable or impossible to model guessing behavior) -The slope parameters are allowed to vary and in this sense, this is basically a non-linear factor analysis model where the slopes are proportional to factor loadings -The 2PL model is appropriate for measures in which items are not equally related to the latent trait. Or, stated from another perspective, the items are not equally indicative of a person's standing on the latent trait (E&R, 2000) -In the 1PL and 2PL Models, the amount of information an item provides is maximized around the item-difficulty (b) parameter. Consequently, items that are matched in difficulty to examinee ability are maximally informative (E&R, 2000)

PSYCHOMETRICS - Reliability approaches Previous comps question: "Some texts define reliability as the correlation of a test given at T1 and T2. Is this a good or poor definition of reliability? Explain. Under what conditions would this correlation not provide a good index of reliability?"

Reliability was defined using CTT: rhoxx'=varianceT/variancex=1-variance E/variancex What are those terms? Which terms are observable? So, because not all those terms ar observable, reliability is not observable. Therefor, we must estimate reliability. And there are many different ways (even basic ways, let alone more advanced techniques) to estimate reliability. In general, there is no one best way to estimate it. Each method has pros/cons METHODS FOR ESTIMATING RELIABILITY: There are generally 3-4 kinds of reliability estimates 1) Interrater/interobserver agreement - we will only have a few words for this type of reliability 2) Retest methods - based on the idea of independent testing 3) Internal consistency methods - based on CTT definitions 4) Parallel forms Deep-dive into each: 1) Interrater: This method is applicable only to tests that have raters. Allen & Yen are mum about this type of reliability because this has only been studied as a psychometric topic in the past 20 years. The rise of interest in portfolio assessment and other so-called authentic assessments has lead to a relatively recent psychometric focus on using human raters. Hopefully the addition of a constructed response improves an exam's validity considerably because it apparently can lower the reliability. *Rosa et al 2001 report constructed response reliabilities that vary from .29-.80. *Page 2003 report that the avg correlation between human graders of Praxis writing assessments was .65 *Landauer et al 2003 and Keith 2003 report higher correlations in the range .75-.89. 2) Test-retest: This method involves administering a test twice with some period of time, the retest period, in between administrations. The retest method may be seconds or years but weeks or months are probably most common This method is not only common in practice, but some texts actually present this as the method of estimating reliability or even the definition of reliability. Why would this be? The chief advantages of this method are its simplicity and obviousness. It's not hard to explain or implement. Test-retest has several causes for concern or possible pragmatic problems: 1) Maturation effects: If the measured trait changes during the retest period, the reliability can be under-estimated. This occurs when there is a ceiling/floor effect or if the rate of maturation varies in the participants, resulting in a different ordering among the pps at T2 and T2 2) Practice effects: Even if no actual maturation occurs, the scores may change at T2 simply because the pps had the experience of taking the test previously. e.g., they may be able to pace themselves better. or on an attitude scale, introspection required to answer the measure ay change their trait standing or the reliability of the scale (Knowles 2000). Or they may be bored. Again this has an effect only to the extent that there are floor/ceiling effects or if the practice effects results in in reordering the pps (but note that changes in the reliability at T2 will definitely affect the ordering). 3) Retention effects: If the pps recall the items or their answers at T2 then this may cause practice effects. But even worse, the pps may present the same responses because they recall their previous responses which will induce serious correlations in the items' error terms. What effect will this have? What are some reasons why people would provide the same responses on cognitive or noncog measures? 4) Dropouts and logistics problems: Getting all the pps to re-take the test strictly for research purposes may be difficult. If some test-takers are not present at both times, they must be dropped from the sample with unknown effects on the estimate. 5) Choice of retest period. The choice of retest period is beyond the scope of psychometric theory and yet this choice has a definite effect on the resulting estimate. Maybe the best advice is to gather data after different retest periods or to match the retest period against an interval of importance according to some external considerations. e.g. if high school seniors take college entrance exams, then a retest period of 12-24 months might make most sense because the seniors will be entering college and completing (or failing to complete) their first year within that time frame. 4) Alternate/Parallel Forms: This method goes by both alternate forms reliability and parallel forms reliability although some people use alternate forms to imply that the two forms are not strictly parallel. As the name implies, pps complete two forms of a test that are parallel, or as parallel as possible. There may be an interval between the two administrations, in which case the same considerations apply. This method may address RETENTION EFFECTS although alternate forms of some kinds of tests could contain similar items. e.g., one Neuroticism item might be "I anger easily" It's not hard to imagine that a person who agreed to this question and who wanted to appear consistent might feel pressure to disagree with a question on an alternate form like "I don't get upset easily" This method typically counterbalances the administration order to minimize the practice effects. Note that this just distributes the practice effect randomly on the first or second admin, which might restrain mean differences but will have the tendency to affect the ordering of test-takers on the 2 administration (thus lowering the reliability) A major issue is the parallelism of the forms. To the extent that the forms have diff reliabilities the avg reliability of the two forms will be estimated, rather than the reliability of either test. To the extent that the content of the forms is different (i.e., that they literally measure diff things), the reliability will be under-estimated-perhaps severely. Alternate forms does not address the issue of needing to administer a test twice and in fact, can introduce new issues (e.g. fatigue, motivation) if the 2 forms are administered concurrently. Split Halves: This method creates two forms from a single form by splitting the action form into 2 halves. Unfortunately, diff methods of splitting produce diff results and any method of splitting may be used. A&Y describe 3 methods. Bc you want these halves to be parallel forms, the best method probably what A&T call matched random subsets but that method is probably least common in practice (e.g. SPSS just splits a test into first and second halves) "DESCRIBE A PURE POWER TEST AND PURE SPEED TEST AND THE CONCEPT OF SPEEDEDNESS" "WHAT WOULD BE THE CORRELATION OF ODD-EVEN ITEMS FOR A PURE SPEED TEST? WHAT ABOUT A FIRST V. SECOND HALF SPLIT? WHY?" If you simply correlate the 2 halves, you get an estimate of the reliability of half the test. To get an estimate of the reliability of the entire test, you must estimate the reliability of the composite of the two halves. These methods (A&Y discuss Spearman-Brown and coefficient alpha - both agree closely, or identically if the halves are tau-equivalent) assume that the 2 halves are essentially tau-equivalent or parallel. 1)Coefficient alpha and related: coefficient alpha is said to be the mean of all possible split half estimates. It is a measure of the error variability based on the consistency of (i.e., covariance of) the items. If the items covary positively to a high degree then coefficient alpha will be close to 1. If the items do not covary positively, then alpha may take on a negative sign (and OF COURSE one would reverse score any reversed items before applying alpha) Alpha is flexible. It can be applied to testlets or to items and the terminology generally used is components or parts. Whatever the nature of the parts, alpha assumes that they are all at least tau-equivalent if not parallel. To the extent that the parts are not all tau-equivalent, alpha is a lower bound on true reliability (i.e., alpha generally under-estimates actual reliability) A&Y described Kuder-Richardson formulae KR20 and KR21. KR21 is a special case of coefficient alpha where the items are dichotomous (i.e., most of the time). KR21 is a special case of KR20 where all the items have the same difficulty; KR21 is not very useful in practice. When can a test be composed of all items with the same difficulty? If tests/items do not have the same difficulty, can they be parallel or tau-equivalent?" Spearman-Brown prophesy formula allows one to estimate the reliability of a composite from the reliability of the parts. The most common use of SB is to estimate the effect of lengthening or shortening a test. If you have an observed reliability estimate (rho yy) and a target reliability (rho xx), you can rewrite the equation to solve for N as a function of both. So et's say you have a 10-item test with a reliability of .50 and you wish to have a reliability of at least .75, how many items do you need to add to the test? 3

METHODS - Confidence Interval & Correlation Coefficient Previous comps questions: 1) What is a CI? 2) How would you build a Ci from an observed correlation coefficient? What does it mean to say that the CI ranges from .1 to .5? 3) How is the CI similar and different from a test of statistical significance?

*1) Definition of CI:* The interval within which the population value of the statistic is expected to lie with a sepcified confidence (1-alpha). If repeated samples were taken from the population and a CI for the mean were constructed from each sample, (1-alpha) of them would contain the population mean (Cohen et al, 2003). *2) Building a CI from an observed correlation coefficient:* a) Note. the lower and upper limits for a CI for rxy (r sub xy) do not fall at equal distances from the obtained sample value. b) To find the CI for a sample r, transform the r to z' and using the SE, and the appropriate multiplier for the size of the CI desired, find the ME (margin of error) and then the lower and upper limit of the CI for z'. Then transform them back to r. c) Then using the multiplier 1.96 from the normal distribution for the 95% limits, we find 1.96 (SEsub z') = ME for z', so z' +/- ME gives the 95% limites for z'. But what we want are the 95% limits for r, so using a table you transform these z' values back to r and obtain the confidence limits for the observed r. For example, if you have confidence limits of .22 and .80, you can expect with 95% confidence that the population r is included in the approximate CI .22 to .88. *3) How is the CI similar and different from a test of statistical significance?:* a) Most behavioral scientists employ a hybrid of classical Fisherian and Neyman-Pearson null hypothesis testing, in which the probability of the sample result given that the null hypothesis is true, p, is compared to a prespecified significance criterion, alpha. If p < alpha, the null hypothesis is rejected and the sample result is deemed statistically significant at the alpha level of sig. (Cohen et al, 2003) 1) A more informative way of testing hypotheses is through the use of CIs. Here an interval is developed around the sample results that would theoretically include the population value (1-alpha)% of the time in repeated samples (Cohen et al, 2003). a) The lower and upper limits of the CI show explicitly just how small and how large the effect size in the population might be. If the population value specified by the null hyp. is not cintained in the CI, the null hyp. is rejected (Cohen et al, 2003) b) The CI also provides a rough and easily computed index of power, with narrow intervals indicative of high power and wide intervals of low power (Cohen, 1994). b) Reichardt & Gollob (1997). Under conditions equally favorable to both statistical tests and CIs, statistical tests are shown generally to be more informative than CIs when assessing the probability that a parameter (a) equals a prespecified values, (b) falls above or below a prespecified value, or (c) falls inside or outside a prespecified range of values. In contrast, CIs are shown generally to be mroe informative than statistical tests when assessing the size of a parameter (a) without reference to a prespecified value or range of values or (b) with reference to many prespecified values or range of values.

CHOOSING NOT TO STUDY METHODS - Canonical Correlation & Redundancy Coefficient Previous comps question: "Employee selection decisions are often based on a battery of test scores, rather than a single test. Similarly, job performance is best characterized as multidimensional, preventing the use of single criterion measure. Discuss how you could use canonical correlation to model predictor-criterion relationships in this situation. Describe the steps in running and interpreting a canonical correlation analysis, included the specific statistics that you would interpret" "Explain the difference between the canonical correlation and the redundancy coefficient from a canonical correlation analysis. Under what conditions will they produce different results?"

*CHOOSING NOT TO STUDY*

METHODS - Measurement Error in Regression Previous comps question "How does measurement error impact the results of multiple regression? Be sure to discuss the impact of measurement error in both the predictors and the criterion." "Does does measurement error affect the results of both simple and multiple regression analysis? Discuss the effect of measurement error in the predictors and measurement error in the outcome variable."

*Measurement Error*. The true score theory is a good simple model for measurement, but it may not always be an accurate refletion of reality. IN particular, it assumes that any observation is composed of the true value plus some random error value. But is that reasonable? What if all error is not random? Isn't it possible that somes errors are systematic, that they hold across most or all of the members of a group? One way tot deal with this notion is to revise the simple true score model by dividing the error component into two subcomponents, random error and systematic error. Here we will look at the differences between these two types of errors and try to diagnose their effect on our research. 1) What is random error, i.e., NOISE? Random error is caused by any factors that randomly affect measurement of the variable across the sample. e.g., each person's mood can inflate or deflate their performance on any occasion. In a particular testing, some children may be feeling in a good mood and others may be depressed. If mood affects their performance on the measure, it may artificially inflate the observed scores for some children and artificially deflate them for others. The important thing about random error is that it does not have any consistent effects across the entire sample. Instead, it pushes observed scores up or down randomly. This means that if we could see all of the random errors in a distribution they would have to sum to 0 - there would be as many negative errors as positive ones. The important property of random error is that it adds variability to the data but does not affect average performance for the group. Because of this random error is sometimes considered noise. 2) What is systematic error, i.e., BIAS? Systematic error is caused by any facotrs that systematically affect measurement of the variable across the sample. i.e., if there is loud traffic going by just outside of a classroom where students are taking a test, this noise is liable to affect all of the children's scores - in this case, systematically lowering them. Unlike random error, systematic errors tend to be consistently either positive or negative - because of this, systematic error is sometimes considered to be bias in measurement. *REDUCING MEASUREMENT ERROR* So how can we reduce measurement errors, random or systematic? 1) One thing you can do is to pilot test your instruments, getting feedback from your respondents regarding how easy or hard the measure was and information about how the testing environment affected their performance. 2) Second, if you are gathering measures using people to collect the data (as interviewers or observers), you should make sure you train them thoroughly so that they aren't inadvertently introducing error. 3) Third, when you collect the data for your study you should double-check the data thoroughly. All data entry for computer analysis should be "double-punched" and verified. This means that you enter the data twice, the second time having your data entry machine check that you are typing the exact same data you did the first time. 4) Fourth, you can use statistical procedures to adjust for measurement error. These range from rather simple formulas you can apply directly to your data to very complex modeling procedures for modeling the error and its effects. 5) Finally, one of the best things you can do to deal with measurement errors, especially systematic errors, is to use multiple measures of the same construct. Especially if the different measures don't share the same systematic errors, you will be able to triangulate across the multiple measures and get a more accurate sense of what's going on. *NO MEASUREMENT ERROR IN THE IV (PERFECT RELIABILITY)* a) When the assumption NO MEASUREMENT ERROR IN THE IV is violated, the estimate of reliability will be biased. The strength of the prediction, R-squared, will always be underestimated. Measurement error commonly leads to bias in the estimate of the regression coefficients and their standard errors as well as incorrect significance tests and confidence intervals. b) To the extent there is random error in measurement of the variables the regression coefficients will be attenuated. To the extent there is systematic error in the measurement of the variables, the regression coefficients will be simply wrong. i) Can compute a corrected estimate of what the slope would be with a perfectly reliable measure: 1) Correction for attenuation: b(carrot) = b (observed coefficient) / r sub xx(reliability) ii) SEM 1) In contrast to OLS regression, SEM involves explicit modeling of measurement error, resulting in coefficients which, unlike regression coefficients, are unbiased by measurement error. ****Errors are independent, errors are independents of predictors, normality of residuals, mean of errors = 0.*****

METHODS - Assumptions of Regression Previous comps question: "Describe the assumptions required for ordinary least squares regression analysis. How can you know whether each assumption has been violated? What impact will violation of these assumptions have on the results of the regression analysis?

*OLS* stands for ordinary least squares. This derives its bane fro the criterion used to draw the best fit regression line: a line such that the sum of the squared deviations of the distances of all the points to the line is minimized. If meeting regression assumptions, equation will be BLUE 1) *B*est among all possible estimators/has smallest sampling error (estimates most consistent across replications of the study) 2) *L*inear = only looking at linear relationship 3) *U*nbiased - if expected value of parameter (slope) = population slope, if averaged a lot of sample, would give the population value 4) *E*stimates of population regression equation SUMMARY OF BELOW FOLLOWED BY DETAILS (N.O.P.E.E.C.I.N.P.N) Nope-ee-cin-pn *1) No measurement error in the IV (i.e., perfect reliability)* *2) Outcome variable is measured on an interval scale or ratio scale, and unbounded* *3) Predictor variables are interval level or dichotomous.* *4) Relationship between variables is linear.* *5) Expected value of the error is equal to zero.* *6) Errors are independent of predictor score.* *7) Constant variance of residuals (homoscedasticity).* *8) Independence of errors.* *9) Normality of residuals.* *10) Predictors are additive - no interactions!* *11) No perfect multicollinearity (MC).* DETAILS *1) No measurement error in the IV (i.e., perfect reliability)* a) When the assumption NO MEASUREMENT ERROR IN THE IV is violated, the estimate of reliability will be biased. The strength of the prediction, R-squared, will always be underestimated. Measurement error commonly leads to bias in the estimate of the regression coefficients and their standard errors as well as incorrect significance tests and confidence intervals. b) To the extent there is random error in measurement of the variables the regression coefficients will be attenuated. To the extent there is systematic error in the measurement of the variables, the regression coefficients will be simply wrong. i) Can compute a corrected estimate of what the slope would be with a perfectly reliable measure: 1) Correction for attenuation: b(carrot) = b (observed coefficient) / r sub xx(reliability) ii) SEM 1) In contrast to OLS regression, SEM involves explicit modeling of measurement error, resulting in coefficients which, unlike regression coefficients, are unbiased by measurement error. *2) Outcome variable is measured on an interval scale or ratio scale, and unbounded* a) No way to judge empirically, only conceptually b) As long as you have roughly equal intervals, you will be fine c) If you think that the variable doesn't even come close to being interval, then you need to use other methods. i) For ordinal outcome, use ordinal regression analysis. 1) Conceptually, you compute correlation of variables, based on those correlations you estimate the regression outcomes ii) For categorical (or dichotomous) outcome variables, use logistic regression that is specifically designed for categorical outcome variables. *3) Predictor variables are interval level or dichotomous.* a) Use dummy coding to represent categories in a way that regression analyses can handle. Trick is defining dichotomous variables in terms of categorical variables. *4) Relationship between variables is linear.* a) Regression analysis is a linear procedure. To the extent nonlinear r-ships are present, conventional regression analysis will underestimate the r-ship. That is R-squared will underestimate the variance explained overall and the betas will underestimates the importance of the variables involved in the non-linear r-ship. Substantial violation of linearity thus means regression results may be more or less unusable. Minor departures from linearity will not substantially affect the interpretation of regression output. Checking that the linearity assumption is met is an essential research task when use of regression models is contemplated. b) Solution: Come up with a model that can accurately predict curved lines: Polynomial regression analysis. Involves taking predictor variable to higher power to be able to better represent curved lines. *5) Expected value of the error is equal to zero.* a) There is no way to evaluate with data. Something that has to be assumed. Violating this assumption doesn't cause bias in slope. Only causes disruption in the intercept. Intercept may be too high or too low. Not a big concern because we rarely care about interpreting the intercept. *6) Errors are independent of predictor score.* a) This is the "Assumption of mean independence": That the mean error is independent of the x IVs. This is a critical regression assumption which, when violated, may lead to substantive misinterpretation of output. b) The (population) error term, which is the difference between the actual values of the DV and those estimated by the population regression equation, should be uncorrelated with each ofo the IVs. Since the population regression line is not known for sample data, the assumption must be assessed by theory. Specifically one must be confident that the DV is not also a cause of one or more of the IVs, and that the variables not included in the equation are not causes of Y and correlated with the variables which are included. Either circumstance would violate the assumption of uncorrelated error. One common type of correlated error occurs due to selection bias with regard to membership in the IV "group" representing membership in a treatment vs. a comparison group): Measured factors such as gender, race, education, etc. may cause differential selection into the two groups and also can be correlated with the DVs. When there is correlated error, conventional computation of SDs, t-tests, and significance are biased and cannot be use validly. *7) Constant variance of residuals (homoscedasticity).* a) Assumption that variance in residuals is constant across all levels of the predictors. To detect this, look at residual plots. IF data are homoscedastic, then plot should look like a rectangle. IF violating this assumption, and have heteroscedastic errors, then you have data that looks fan-shaped. IF this happens, then estimate of regression slopes will still be unbiased but will no longer be the best estimate. Not "BLUE," there are more precise estimates to be found. Not the most efficient. Also, standard errors and significance tests are biased. b) Solution: Use a statistical analysis that takes into account differences in error variance. Do not use oLS estimation. Instead, use Weighted Least Squares (WLS) and Generalized Least Squares (GLS). Will give you more accurate estimates of regression coefficients. Methods don't treat all observations equally. *8) Independence of errors.* a) Errors associated with on observation tell you nothing about the error for another observation. b) Independent observations (absence of autocorrelation) leading to uncorrelated error terms. Current values should not be correlated with previous values in a data series. This is often a problems with times series data, where many variables tend to increment over time such that knowing the value of the current observation helps one estimate the value of the previous observation. Spatial autocorrelation can also be a problem when unites of analysis are geographic units and knowing the value for a given areas helps one estimate the value of the adjacent area. That is, each observation should be independent of each other if the error terms are not to be correlated, which would in turn lead to biased estimates of SDs and significance. c) Autocorrelation: Correlation of residuals across time- violation of this assumption. d) Clustered data: Data that can be described at multiple levels of aggregation sampling both groups and individuals within groups - there might be issues within group or individual student. Tend to find that individuals within groups are more similar to each other than individuals between groups. Thus residuals aren't independent. Group effect: residuals for one person in a group will tell you what another person in that group's residual will be. i) Either kind of situation tends to create bias in standard errors and in significance tests. CIs will be small than they should be; significant errors suffer from Type I errors, higher significance than should be (inflation). e) Solution: want to include the source of the DVs as part of the model. IF we can figure out what is causing dependence and include it in the model instead of part of the errors, then you can solve the problem and correct for autocorrelations. f) Solution; The Durbin-Watson coefficient, d, tests for autocorrelation. The values of d ranges from 0 to 4. Values close to 0 indicate extreme positive autocorrelation; close to 4 indicates extreme negative autocorrelation; and close to 2 indicates no serial autocorrelation. As a rule of thumb, d should be between 1.5 and 2.5 to indicate independent. of observations. Positive autocorrelation means standard errors of the b coefficients are too small. Negative autocorrelation means standard errors are too large. *9) Normality of residuals.* a) Typically, non-normal errors result in having variables in models that are also not normally distributed. Need to look at residuals, though to see if they are normally distributed. b) Use histograms to plot frequency of residuals. c) The effects of non-normality have no effect on regression coefficients. d) Assumption needed for CI and sig tests e) Extreme non-normality can bias standard errors and sig tests but if sample size is large, the test is fairly robust for nonnormality. f) Slight non-normality, not a problem if you have small sample size, and non-normal data, can use data transformation to make distributions more normal. Transformations usually don't make a big difference in results. Disadvantage of transformations, regression analysis is in terms of transformation than in terms of the original variables. *10) Predictors are additive - no interactions!* a) Can look at effect of each predictor in an additive way. Adding up effects of different variables. Look at effects somewhat independently b) When assumption is not met, then there is an interaction among variables. Effect of predictor 1 depends on the level of predictor 2. c) Solution: moderated multiple regression: allows us to model interaction effects. *11) No perfect multicollinearity (MC).* a) Three types of multicollinearity: i) None: Predictors are independent/orthogonal. ii) Some: Correlation among predictors iii) Perfect: At least one predictor is redundant. b) Three different problems we have due to MC: i) Computational problems: Only a problem when we have perfect or near perfect MC. If you have perfect MC, it becomes impossible to to compute regression coefficients. ii) Interpretation problems: MC creates difficulty in interpretation of results. Interpretation is ambiguous about what to do with shared contribution between variables. Interpretation process is much more challenging when you have related measures. iii) Decreased precision: As MC increases, the precision decreases, and standard error increases. Inflated standard error, and beta weights. 1) Less power for sig. tests 2) Wider CIs 3) Less ability to make inferences of results c) Collinearity diagnostics: Useful for detecting if you will have problems with MC i) MC Index: Computed separately for each predictor, tells you how much one variable overlaps with other predictors. Close to 1 indicates a problem with MC ii) Tolerance = 1 - MC index. If tolerance value is less than some criteria (i.e., less than .001), then SPSS won't even try to compute regression coefficients. iii) Variance Inflation Factor (VIF): 1) Index of how much MC inflates sampling variance 2) People have suggested that if you have greater than 10, then indication of a problem with MC. d) Solutions: i) Get a larger N ii) Combine predictors into a scale, overall score should be used instead of individual components (only makes sense when variables are conceptually related) 1) Sometimes, however, composites may not be meaningful which increases difficulty with interpretation of results iii) Drop one variable that seems to be overlapping with others. Should only drop in cases with extreme MC 1) Drop variable least theoretically meaningful 2) Drop most expensive to administer iv) Advanced Estimation Methods 1) Ridge regression is an attempt to deal with MC through use of a form of biased estimated in place of OLS. The method requires setting an arbitrary "ridge constant" which is used to produce estimated regression coefficients with lower computed standard errors. However, because picking the ridge constant requires knowledge of the unknown population coefficients one is trying to estimate, Fox (1991) recommends against its use in most cases.

METHODS - Moderator & Mediator Effects Previous comps question: "Theoretical models often include moderator and mediator effects. Define and contrast these two types of effects. Give an example of a situation where each would be appropriate. How would you test for each type of effect using multiple regression?*

*Whereas a mediating relationship relationship attempts to identify a variable or variables through which the IV tends to act to influence the DV, moderating relationships refer to situations in which the relationship bt the IV and DV changes as a function of the level of a 3rd variable (the moderator).* Wagner et al (1988) hypothesized that individuals who experience more stress, as assessed by a measure of daily hassles, will exhibit higher level of symptoms than those who experience little stress. That is what, in analysis of variance terms, would be the main effect of hassles. However, they also expected that if a person had a high level of social support to help them deal with their stress, symptoms would increase slowly with increases in hassles. For those who had relatively little social support, their symptoms were expected to rise more quickly as hassles increased. In a mediating relationship, some variables mediate the relationship bt two other variables. e.g., take a situation where high levels of maternal care from your mother lead to feelings of competence and self-esteem on your part, which in turn, lead to high confidence when you become a mother. Here we would say that your feelings of competence and self-esteem MEDIATE the relationship between how you were parented and how you feel about mothering your own kids. Baron and Kenny (1986) laid out several requirements that must be met before we can speak of a mediating relationship. The diagram below represents a mediating relationship: (Mediator) / \ a b / \ (IV) -- c -- (DV) The predominant r-shop that we want to explain is labeled "c" and is the path from the IV to the DV. The mediating path has two parts, comprised of "a" (the path connecting the IV to the potential mediator) and "b" (the path connecting the mediator to the DV). Baron & Kenny (1986) argued that for us to claim a mediating r-ship, we need to first show that there is a sig. r-ship bt the IV and the mediator (if the mediator is not associated with the IV, then it couldn't mediate anything). The next step is to show that there is a sig. r-ship bt the mediator and the DV, for reasons similar to those of the first requirement. Then we need to show that there is a sig. r-ship bt the IV and the DV. These 3 conditions require that the 3 paths (a, b, c( are all individually sig. The final step consists of demonstrating that when the mediator and the IV are used simultaneously to predict the DV, the previously sig. path bt the IV and DV is now greatly reduced, if not nonsig. Maximum evidence for mediation would occur if c drops to 0. *EXAMPLE* 1) Training related to motivation, Training = IV, Motivation = Mediator, Performance = DV. Motivation related to Perf, Training related to Perf. The r-ship bt Training and Perf is drastically reduced or goes down to zero when motivation is added. *Full vs. Partial Mediation* Full is when predictor doesn't add to explanation of DV above mediator. Partial is when mediator partially explains relationship bt IV and DV but IV still directly explains DV.

PSYCHOMETRICS - Computer Adaptive Testing (CAT) Lecture Notes Previous comps question: "What are some other adaptive models?"

1) CATS are only useful when you want uniform, high information over the range of theta. Certification and other pass/fail testing does NOT require uniform, high reliabilities and it would be a waste to use a traditional CAT model fro such tests. An alternative CAT Model, formulated for such situations, administers items until the 95% CI for theta-hat no longer contains the cut-score, or until the maximum number of items have been administered. This kind of CAT saves administration time for candidate who are clearly above or below the cut-score. Borderline candidates tend to have long tests under this scenario. The NCLEX nursing cert is like this. 2) The AICPA adopted a diff kind of adaptive model for the Uniform CPA exam. Many of the hardest problems with item-level adaptive tests are in the realm of automated test assembly (ATA). That is, a CAT literally assembles a test form on the fly while the test-taker is sitting there. Sometimes this can cause problems. CAST (CA sequential testing) or MST (multistage testing) solves or ameliorates many of the operational practical objections to adaptive testing. 3) As an alternative to CAST/MST,Van der Linden has proposed "shadow tests" in which entire tests are built on each iteration and a single item is selected from that test. Shadow testing reduces or eliminates the possibility of being unable to solve the ATA problem during testing. Clinical measurement researchers have explored adaptive algorithms for the MMPI which bear little resemblance to these CATs

PSYCHOMETRICS - SEM in IRT vs CTT 1) In IRT, SEM ___________________. 2) In CTT, SEM _________________________.

1) "changes as a fxn of examinee trait level" 2) "is constant, regardless of examinee raw score" Once a researcher knows a test's info fxn, how precise that test is @ various stages of the latent trait can be determined. (SEM=(1/sqrtTIF))

PSYCHOMETRICS - Development of a selection test Previous comps question: "You are a consultant asked to develop an assessment for an administrative assistant position. This position requires a moderate degree of quantitative ability. Discuss the steps in the development of a selection test to measure "quantitative business skills" Assume that a thorough job analysis has already been conducted. Explain each step or task briefly and include any needed rationale or justification. Do not discuss the validation of the test"

1) Define the construct and domain of testing (based on JA, client request, standards, etc.) 2) Create test specification -Answer questions such as: -What are the target population(s)? -What administration mode(s)? Paper-pencil? CBT? CAT? IVR? -How long is the test (in terms of items or administration time) -What types of items? MC, FITB, essay, short answer -Create specific standards for all important test/item characteristics -Percent in each content area (or a more complex content spec) -Avg difficulty and range of difficulties -Target reliability -Editorial or SME standards -Surface characteristics (e.g., at least 10% Latino-sounding names) -Target number of words (or other measure of required reading) -Factorial validity 3) Write items -Write 50-200% more items than required by the final test length -Use as many SME writers as practical -Use good item-writing techniques (e.g., Haladyna) -Adhere to the test specifications -Editorial review and refinement; possibly screening -Sensitivity screening 4) Pretest -Select a pilot sample that is as large and representative as possible (some disagreement on this, below N=50 or 75, item analysis is meaningless; N=200 is a common target) -As much as possible, pretest the items in a standardized setting that matches operational conditions -Allow sufficient time so that the pretest is not speeded -If possible, collect measures of validity criteria 5) Item analyses (which differ based on CTT or IRT) -Balance psychometric concerns with concerns of other stakeholders (e.g., SMEs) -But advocate for important psychometric issues; don't let SMEs include many psychometrically poor items a) General issues: The goal is to pick the best items *Most reliable *Most valid *Optimally fitting the test specs *Avoiding legal entanglements *Avoiding formatting issues (e.g., fitting items into booklets, balanced key) ***The phrase item analysis is misleading because there are many analyses that can be performed and there is no one right way to do an item analysis. Very often the specific circumstances of the test development dictate that some aspect of item analysis be emphasized or de-emphasized ***Also, item analysis is generally a cooperative effort of all stakeholders. Henrysson (1971): "it must be stressed that items cannot be selected only on the basis of their statistical properties; it should be agreed by all concerned that the item is...good" b) Item Statistics *Item mean and standard deviation (The item mean is the proportion correct, p, if the item is scored 0-1 (in which case s=p(1-p) *Item-total correlations (e.g., correlation of the tiem with the scale or test number-right score) *Ideally corrected so that the item is not included in the score *Pick either the point-biserial or the biserial and stick with it *Consider plotting the item-total against the difficulty and selecting the most correlated in several difficulty ranges *Also "option totals" *Correlation of the item with other scales (in the case of a test having multiple, correlated scales) *Alpha-if-item-deleted (i.e., literally, alpha is re-computed leaving the item out) *If the alpha rises without this item, considering deleting the item to increase alpha *This statistic usually mirrors the item-total *Don't get carried away or take this too seriously; "alpha purification" overestimates reliability and ignores validity (so it could potentially make a scale less valid) c) Record-keeping *Because item analysis is a team effort, involving many hurdles, it can be critical that good records are kept documenting the work performed *Tip: Spreadsheets are ideal for this purpose *Tip: Do not allow SMEs to encode data into the item ID because data changes but you want the item ID to follow the item forever *Tip: Choose item IDs that you can use as a variable name in SPSS, etc. to unmistakably link the item statistics to the item ID *Record keeping merges into item banking for large pools of maintained items 6) Norming, Scaling, Setting Cut-scores *This step depends greatly on the intended use of the test (may include additional data gathering) *The Angoff method or some modification is the most common method of setting a passing score on a certification or licensure exam without objective external validity criteria *Empirical or statistical expectancy charts are often used in selection settings (less commonly, utility analysis) a) Norming *Norming is the process of providing norms for a test. Many types of tests do not have formal norms (e.g., selection and certification exams). If the test-taker will receive feedback on their test performance, then norms are probably required. The test developer should provide for each intended use or target population *Norms are generally just percentiles, so it is important that the norm group be meaningful to the test users and test-takers and well-defined and well-described by the test developer *Required sample sizes depend on the situation, but the sample sizes must authoritatively define the population usually hundreds or thousands for widely-used published instruments *Local norms (within you gender, ethnicity, geographic area, school, employer, etc.) can be a useful contrast to more global norms b) Angoff Method of setting a passing score *A panel of SMEs are convened *They should be well-qualified to judge the performance of acceptable and unacceptable test-takers. Size of panel is debatable, but ideally larger than "a few" *The panel defines a minimally acceptable candidate (MAC)-a test-taker who is just barely qualified to pass the exam *For each item, each SME estimates the probability that the MAC would correctly answer the item *This may be followed by group discussion with or without a consensus requirement *SMEs may be given information about the difficulty of the items (for the whole group) *The avg probability estimate is recorded for each item *The sum of these estimates indicate the expected test score of someone just barely acceptable. Scores below this passing score indicate unacceptable candidates *The validity of the standard setting procedure depends on the training, acceptability, and performance of the SMEs *The evidence of validity of the standard depends upon the thorough documentation of the standard setting process (as well as the qualifications and performance of the SMEs) *The process is subjective and thus of imperfect reliability c) Expectancy Charts *ECs show the r-ship between a test score (or, most commonly, ranges of test scores) and some organizationally-important criterion such as job performance, sales, safety, likelihood of stealing, likelihood of terminating, etc. *May be based on empirical data (e.g., counting the number of test-takers in the score range with "outstanding" supervisory ratings) or on a statistically model (e.g., regression of supervisory ratings onto test scores) *May be used to set a pass/fail decision or as a multiple-levels (e.g., "stoplight results") or to allot "points" *Simpler than a formal utility analysis and easier to fit into an "elevator description" d) Scaling *Refers to techniques that transform the raw scores of a test or scale to a standard score scale. e.g., GRE scores do not naturally have a mean of 500 and a st dev of 100 (there are far fewer than 500 items on the exam). Using scaling, any arbitrary standard score can be achieved. Scaling is not equating, although the two techniques are often used together Scaling is used in many instances 1) You have a pre-existing standard scale that must be used for reporting purposes 2) To aid examinees or others who will see the test scores and try to interpret them, especially when the raw score metric is "funny" 3)On multi-scale instruments so that scores on different scales can be compared directly There are many scaling techniques; one very common approach is based on linear transformations derived from standardization. Standardizing (or z-scoring) a variable changes its mean to zero and its SD to one: xz=x-xbar/s Let x be the raw score, x* be the scaled score, xbar* be the desired mean, and s* be the desired st dev: x*=(x-xbar) * s*/s + xbar* If the scale is not predetermined, consider picking a metric that produces an easy-to-use SEM SEM = s*sqrt(1-rxx') therefore s=SEM/sqrt(1-rxx')

PSYCHOMETRICS - DIF Previous comps question: "What are the steps in DIF analysis?"

1) Verify the data are unidimensional (use scree plots, DIMTEST, modified parallel analysis 2) Use a program like BILOG to estimate the parameters for each group 3) The item parameters of the focal and reference group must be placed on a common metric 4) USE DIF analysis and identify items that have DIF 5) Removed flagged items and relink the metrics with the flagged items 6) Repeat DIF analysis for all items using the new relinked metrics and identify DIF items 7) Repeat steps 5 and 6 until the same DIF items are identified on the same consecutive trials

PSYCHOMETRICS - G Theory Outline Previous comps question: "Describe the Motivation for G Theory"

CTT gives us X=T+E which is a simple model that really accomplishes a lot. But a lot of what we discussed about estimating reliability illustrates that the CTT model leaves a lot either implicit or unstated. We discussed: *How disparate the retest and internal consistency methods are *How reliability is sample-specific (an interaction between the person and the test) *G theory explicates these issues into a common framework. Shavelson et al state: Test-retest reliability counts day-to-day variation in performance as error, but not variation due to item sampling. An internal consistency coefficient counts variation due to item sampling as error, but not day-to-day variation. Alternate-forms reliability treats both sources as error. Ct, then, sits precariously on shifting definitions of true-and-error scores. Later, the authors cite Cronbach: The question of reliability tehn resolves into a question of accuracy of generalization or generalizability.

METHODS - SHRINKAGE p2 2) Explain the substantive / practical effects of shrinkage. Cohen et al (2003)

2) *Substantive/practical effects of shrinkage*: a) The most desirable outcome in this process is for minimal shrinkage, indicating that the prediction equation will generalize well to new samples or individuals from the population examined. b) *How much shrinkage is too much shrinkage?*: There are no clear guidelines concerning how to evaluate shrinkage, except the general agreement that less is always better. But is 3% acceptable? What about 5% or 10%? Or should it be a proportion of the original R-squared (so that 5% shrinkage on an R-squared of .50 would be fine, but 5% shrinkage on R-squared of .30 would not be)? No guidelines in the literature. c) The smaller the sample size, the greater the inflation of the sample R-squared. The more IVs there are, the more opportunity for the sample R-squared to be larger than the true population squared multiple correlation (Cohen et al, 2003)

METHODS - SHRINKAGE p3 3) Describe at least two ways to honestly estimate the *true multiple R* Osborne (2000) and Cohen et al (2003)

3) *Two ways to honestly estimate the true multiple R*: One way is *cross-validation* and another is *shrunken or adjusted R-squared*. 1) Cross-validation: To perform cross-validation, gather either two large samples or one very large sample which will be split into two samples via random selection procedures. The prediction equation is created in the first sample. The equation is then used to create predicted scores for the members of the second sample. The predicted scores are then correlated with the observed scores on the DV (ryy'). This is called the cross-validity coefficient. The difference between the original R-squared and ryy'2 is the shrinkage. The smaller the shrinkage, the more confidence we can have in the generalizability of the equation (Osborne, 2000). *Basically divide sample into main and holdout, using main sample, then use DF on holdout sample to evaluate accuracy. 2) Shrunken or adjusted R-squared: It is often desirable to have an estimate of the population squared multiple correlation (and of course we want one that is more accurate than the positively biased sample R-squared). Such a reliable estimate of the population squared multiple correlation is given by: R^2 = 1 - (1-R^2)*(n-1) / (n-k-1) This estimate is appropriately smaller than the sample R-squared and is often referred to as the "shrunken" R-squared.

PSYCHOMETRICS - Correction for attenuation Previous comps question: "The standard correction for attenuation can be used to correct for predictor unreliability or criterion unreliability or both. Explain when each of these 3 variations would be appropriate. Give a concrete example of each type of situation.

Allen & Yen (figure out date - check book) If you hae measurement error in both x and y, that will reduce your validity coefficient...this is why you would evevn care for correcting for preidctor and criterion unreliability. *If both test score and criterion score are unreliable, the validity coefficient may be reduced in value relative to the validity coefficient that would be obtained if x and y did not contain measurement error *Attenuation inequality sqrt of ryy*rxx>/= rxy. Reliability puts an upper limit on validity Therefore, predictor and criterion reliability put a limit on the amount of validity. By achieving a high reliability you can hope to achieve high validity. If we measure criterion without error, what would validity be? We are looking at the correlation between observed predictor score and true criterion score. When correcting for unreliability in criterion only - that is acceptable. Basically assuming that there is no unreliability in criterion. In practical settings, we can assume that performance can be measured without measurement error. However, tests can never be without error. Therefore, in order to correcct for unreliability in the criteriona, rxTrxy/sqrtryy. The only time that you would correct for unreliability in the predictor is when you want to compare two tests that have different reliabilities. If you have more than 1 test with differing reliabilities and you want to decide which test will most likely produce the highest validity, you want to correct for the unreliability in the predictor to choose the most appropriate test.

PSYCHOMETRICS - 1 PL Model (what's another name for it also?) Describe in detail and note what is different from 3 PL Model and 2 PL Model

Also called the Rasch model -The a parameters are all constrained to be equal (=1) so only difficulty (b) and theta are left in the model -General feature of the ICC in the Rasch model: -The probabilities gradually increase with trait level for each item. In other words, as your ability level increase, the prob. of getting the item correct increases -Items differ only in difficulty. Slopes of the curves are equal. The curves converge but they do not cross. -The point of inflection of the ICC, where the rate of change shifts from accelerating increases to decelerating increases, occurs when the prob of passing an item is .50 -This model has substantial elegance: -Lord's paradox cannot occur bc item curves do not cross. -Guessing does not cloud the estimation of theta -The number-right score is a sufficient statistic for theta allowing estimating of the item parameter to be far easier and less controversial. In other words, bc there are not guessing or differing discrimination parameters, the number of items answered correctly is a sufficient statistic to estimate an individual's theta -There is less ambiguity comparing the difficulty of items and the ability of people How diff from 2PL and 3PL models? -In the 1PM and 2PL, the amount of info an items provides is maximized around the item-difficulty (b) parameter. Consequently, items that are matched in difficulty to examinee ability are maximally informative (E&R, 2000) -In the 1PL model it is assumed that item difficulty is the only item characteristic that influences examinee performance. There is no discrimination parameter - thus all items are equally discriminating. Also the lower asymptote of the ICC is zero...this specifies that examinees of very low ability have zero probability of correctly answering the item. Thus, no allowance is made for the possibility that low-ability examinees may guess, as they are likely to do on MC tests (Hambleton et al. 1991...this is a flaw in the 1PL model) -Also, the 1PL is based on restivtive assumptions. The appropriateness of these assumptions depends on the nature of the data and the importance of the intended application. e.g., the assumptions may be quite acceptable for relatively easy tests constructed from a homogenous bank of test items (Hambleton et al 1991)

METHODS - Communality - (Definition) Comps questions related to Communalities AFTER PRINTING, ANSWER BELOW WITH FLASHCARD 1) You have conducted a factor analysis of a job attitude survey. One of the items on the survey has very small loadings on all of the factors (variable's low correlation with factor), and has a very low communality (proportion of variance accounted for by factor). What are the possible explanations for this and should the item be dropped? 2) What is communality? Explain the central role of communality estimates in factor analysis. How are initial communality estimates computed, and how do these differ from final communality estimates? 3) Define communality. Why are communality estimates needed for factor analysis? Describe how both initial and final communality estimates are computed. How are initial communality estimates computed? How does this differ from the final communality estimates? 4) Define and discuss the central role communality plays in factor analysis.

Communality or h^2 (h-squared) is the sum of square loadings for a variable across c (number of factors). Communality represents the percent of variance in a given variable explained by all the factors jointly and may be interpreted as the reliability of the indicator. *1) Low communality.* When an indicator variable has a low communality, the factor model is not working well for that indicator and possibly it should be removed from the model. Low communalities across the set of variables indicates the variables are unrelated to each other. However, communalities must be interpreted in relation to the interpretability of the factors. A communality of .75 seems high but is meaningless unless the factor on which the variable is loaded is interpretable, though it usually will be. A communality of .25 seems low but may be meaningful if the item is contributing to a well-defined factor. That is, what is critical is not the communality coefficient per se, but rather the extent to which the item plays a role in the interpretation of the factor, though often this role is greater when communality is high. *2) Spurious solutions.* If the communality exceeds 1.0, there is a spurious solution, which may reflect too small a sample or the researcher has too many or too few factors. *3) Computation.* Communality for a variable is computed as the sum of squared factor loadings for the variable (row). Recall r-squared is the percent of variance explained and since factors are uncorrelated, the squared loadings may be added to get the total percent explained, which is what communality is. (a) For full orthogonal PCA, the initial communality will be 1.0 for all variables and all of the variance in the variables will be explained by all of the factors, which will be as many as there are variables. The "extracted" communality is the percent of variance in a given variable explained by the factors which are extracted, which will usually be fewer than all the possible factors, resulting in coefficients less than 1.0. (b) For PFA and other extraction methods, however, the communalities for the various factors will be less than 1 even initially. Estimates for how to obtain communalities: 1) Square multiple correlation a) Regress all other items on that item to get the R-squared statistics (default in SPSS) 2) Largest correlation with another variable 3) Reliability: a) Variance in x is divided into shared variance plus unique variance. Unique variance is separated into true score unique variance and random measurement error. Shared variance plus true score unique variance equals your reliability (reliability is an UPPER BOUND on communality)

PSYCHOMETRICS - DIF Previous comps question: "What are the Non-IRT methods of DF detection?"

NON IRT METHODS 1) Mantel-Haenszel technique A) Compares the item performance of focal and reference group examinees across score groups B) Score groups are made up of all examinees with the same raw score or all examinees with raw scores at a predetermined interval C) At the score group level, individuals from the focal and reference group are compared on the item and are expected to do it equally well on it (In other words, if the raw score is the same the proportion of people getting the item right from the focal group should be the same as in the reference group. Raw score is used as proxy for theta. If the proportions are diff, then you have DIF D) Chi square test is used to test if the proportions are equal. In your score range, focal and reference groups, same proportion should be getting the item correct within these two groups PROS: easy to use, does not require large sample sizes, most powerful method when DIF is uniform and the group mean abilities are the same CONS: Not sensitive to nonuniform DIF, known to confuse DIF w/ impact, non-robust when group mean abilities are very different (this means that if groups truly differ in abilities, the test is biased) 2) Logistic regression A) Swaminathan & Rogers proposed a DIF technique using a log regression fxn, in which the prob of answering an item correctly is expressed as give equation and explain: x is the observed score g=1 if examinee belong to the reference group g=0 if examinee belongs in focal group xg represent the interaction between the observed score and group membership b0 b1 b2 b3 are the regression coefficients: b2 does not = 0 and b3=0 represents uniform DIF (opposite indicates nonuniform DIF) The 4 coefficients are estimated for each item and tested to determine if they're sig diff from 0; based on this info, type and extent of DIF on a given item can be assessed A type of hierarchical regression method: Need to understand intercorrelation of beta weights PROS: Sensitive to uniform and nonuniform DIF, more flexible than IRT model: allows different models to be specified, can handle multiple ability estimates, easy to use, may not require large samples CONS: Computationally intensive, the effect size measures and set guidelines are not sensitive to actual magnitude of DIF

METHODS - Outliers Previous comps question: "Describe outliers

Outliers are a form of violation of homoscedasticity. Detected in the analysis of residuals and leverage statistics, these are cases representing high residuals (errors) which are clear exceptions to the regression explanation. Outliers can affect regression coefficients substantially. The set of outliers may suggest/require a separate explanation. Some computer programs allow an option of listing outliers directly, or there may be a "casewise plot" option which shows cases more than 2 SD from the estimate. To deal with outliers, you may remove them from analysis and seek to explain them on a separate basis, or transformations may be used which tend to "pull in" outliers. These include the square root, logarithmic and inverse (x = 1/x) transforms.

METHODS - When should you use PCA vs PAF?

Situations/consideration to recommend for each to be used: *Generally, the choice between PCA and PAF depends on your assessment of the fit between the models, the dataset, and the goals of the research.* *PCA* 1) IF you simply want an empirical summary of your dataset, PCA is the better choice! (T&F, 2007) 2) PCA is generally used when the research purpose is data reduction (to reduce the information in many measured variables into a smaller set of components) 3) PCA is also useful as an initial step in FA where it reveals a great deal about maximum number and nature of factors (T&F, 2007). *PAF* 1) IF you are interested in a theoretical solution uncontaminated by unique and error variability and have designed your study on the basis of underlying constructs that are expected to produce scores on your observed variables, FA is your choice! (T & F, 2007) 2) PFA is generally used when the research purpose is to identify latent variables which contribute to the common variance of the set of measured variables, excluding variable-specific (unique) variance 3) PFA is primarily used as a tool for reducing the number of variables or examining patterns of correlations among variables (T & F, 2007). a) Under these circumstances, both the theoretical and practical limitations of PFA are relaxed in favor of a frank exploration of the data

METHODS - Confidence Interval & Regression Coefficient Previous comps questions: 1) Describe how you would build a confidence interval around a regression coefficient from a multiple regression analysis. What exactly does this interval tell you about your result? What factors influence the width of the confidence interval?

The confidence interval (CI) of the regression coefficient *Based on t-tests, the CI is the plus/minus range around the observed sample regression coefficient, within which we can be, say, 95% confident the real regression coefficient for the populations regression lies.* Confidence limits are relevant only to random sample datasets. If the CI includes 0, then there is no significant linear relationship between x and y. We then do not reject the null hypothesis that x is independent of y. What does the interval tell you about your results? *READILY PROVIDES SIGNIFICANCE TESTING INFORMATION along with a range of plausible population values for a parameter. IF the null-hypothesized value of the parameter (often, zero) does not fall within the calculated CI, the null hypothesis can be rejected at the specified alpha level. *Therefore, CIs provide more info than the conventional null hypothesis sig. test procedure.* What factors influence the width of a CI? *For a given level of confidence, the narrower the CI, the greater the precision of the sample mean as an estimate of the population mean. In a narrow interval, the mean has less "room" to vary. There are 3 factors that will influence the width of a CI at a give level of confidence. 1) First, the width of the CI is related to the variance of the sample scores on which it is calculated. If this sample variance can be reduced (e.g, by increasing the reliability of measurements), the CI will be narrower, reflecting the greater precision of the individual measurements. Selecting a sample that is more homogenous will reduce the variance of scores and thereby increase their precision. This factor, however, is often outside the researcher's influence. 2) Second, following the principles of sampling theory, sampling precision increases in a curvilinear fashion with increasing sample size. This increase in precision occurs because the variance of a statistics, as expressed by its standard error, decreases as sample size increases. Figure 1 shows 4 sample of a progressively greater size drawn from a single population of physical therapists and the mean period of post-qualification experience for each sample. The mean is precisely the same in each case, but the CI becomes narrower as the sample size increase. As sampling precision is related to the square root of the sample size, doubling the sample size will only decrease the width of the CI by 25% (Hurlburt, 1994). 3) Third, the chosen level of confidence will influence the width of the CI. If the investigator wants to be 99% confident of having included the population mean within the interval would be wider. With a higher level of confidence, the interval needs to be wider in order to support the claim of having included the population parameter at the chosen level of confidence. Conversely, a 90% CI would be narrower than a 95% CI (Gore, 1981). ***IN SUM*** *3 factors affect width of CI* 1) Decreased sample variance leads to narrower CI (increased measurement and reliability, increased homogeneity of sample, and decreased error variance). 2) Increased sample size leads to narrower CI (as N increases, SE goes down, leads to narrower CI). 3) Decreased confidence level leads to narrower CI (with a higher level of CL, e.g., 99%, need to have wider CI to be sure the population N estimate is included).

PSYCHOMETRICS - How do all the reliability estimation methods relate to each other?

The retest methods are based directly on the concept of repeated, independent administrations. The internal consistency methods are based on the definition reliability as a ratio of variances (or as the correlations of observed and true scores). That is these methods are based on substantially different bases. What relationship can you draw between such different methods? That said, I think the internal consistency estimates produce a more "psychometric" measure of reliability (an estimate based on theory) but one that is at a single point i time. The retest methods all involve additional sources of error and so they generally produce lower estimates. Also putting aside the (abstract) theory of CTT, a real-world scale that was perfectly internally consistent would ask tiny variations of the same item. That would obviously not be a "good" measure.

PSYCHOMETRICS - Best practices for estimating reliability

There is no one best way to estimate reliability. Here are several best practices: 1) Because reliability is specific to a sample, use the most representative possible data when computing a reliability estimate. It is also important to users of your instrument that you thoroughly documented the population from which you sampled you data. *Corollary: If a reliability estimate is not given for your population (or details regarding the reliability estimation are sketchy/missing) then you may experience significantly different reliability in your use of the instrument and you must be cautious in your use of the measure 2) Match the type of estimate to the problem. For a predictive validity study, you might be most interested in a test-retest estimate with a test-retest comparable to the period between predictor and criterion measurement 3) Compute and report more than one estimate. e.g., compute an internal consistency estimate and also a test-retest or alternate-fors estimate. You may wish to collect several test-retest estimates after different retest periods. Consider collecting data from different populations (or subpopulations) 4) Do not apply reliability methods blindly. e.g., avoid inducing a positive correlation among error terms (which will cause the error to be counted as true score variance): a) Do not use internal consistency methods for speeded tests b) Do not use such short test-retest periods that pps repeat their correct and incorrect answers

PSYCHOMETRICS - Key CTT Results per Allen & Yen *basic question, what are the symbols for standard deviation and variance? 1) What are 7 key conclusions of CTT? 2) When are tests considered parallel forms?

sigma = standard deviation and sigma squared = variance #1 answer 1) In a population, the variance of the observed score is equal to the sum of the variances of the true and error scores sigma squared sub E = sigma squared sub T + sigma squared sub E 2) Reliability can be defined as a ratio of variances 3) Reliability is equal to the square of the correlation between observed and true scores (which is usually interpreted as implying that the criterion-related validity of a test can never be higher than the sqroot of the reliability 4) The standard deviation of the error term (called the standard error of measurement (SEM)) can be calculated as the True Score standard deviation * sq rt of 1-reliability 5) The correlation of the true scores of two variables can be estimated 6) CTT provides a means of estimating the reliability of a composite 7) And finally, CTT shows that as test length increases, reliability increases. It is known that this is a curvilinear relationship. Below the point of inflection, lengthening a test improves reliability dramatically while above the point of inflection there are decreasing marginal gains in reliability as test length increase. (Sadly, virtually all tests fall into the latter category where tests have to be lengthened substantially to realize even small gains in reliability) #2 answer 1) Two tests are parallel to the extent that they produce the same scores for any individual 2) Two tests are essentially tau-equivalent if they are parallel after a linear transformation...which is almost the same thing as being parallel tests

METHODS - Significance Testing - Limitations & Alternatives Previous comps question: 1) Schmidt (1992) has argued that over-reliance on statistical significance testing has impeded progress in psychology. Describe the logic of significance testing. 2) What information about the data is provided from a significance test? 3) What are the limitations of traditional significance tests? 4) What alternatives exist to overcome these limitations?

*1) Schmidt (1992) has argued that over-reliance on statistical significance testing has impeded progress in psychology. Describe the logic of significance testing.* a) The most common statistical procedure in to pit a NULL hypothesis (H0) against an alternative hypothesis (H1). 1) H0 usually refers to "no difference" or "no effect". 2) H0 is a specific statement about results in a population that can be tested (and therefore nullified). 3) One reason that null hypotheses are often framed in terms of "no effect" is that the alternative that is implied by the hypothesis is easy to interpret. Thus, if researchers test and reject the hypothesis that treatments have no effect, they are left with the alternative that treatments have at least some effect. 4) Another reason why testing the hypothesis that treatments have no effect is that probabilities, test statistics, etc. are easy to calculate when the effect of tx is assumed to be nil. 5) In contrast, if researchers test and reject the hypothesis that the difference between tx is a certain amount (e.g., 6 points), they are left with a wide arrange of alternatives (e.g., diff is 5 points, 10 points, etc.) including the possibility that there is NO difference. *2) What information about the data is provided from a significance test?* a) Showing that a result is sig. at the .05 level does not necessarily imply that it is important or that it is especially likely to be replicated in a future study (Cohen, 1994). b) Tests of the traditional null hypothesis are more likely to tell researchers something about the SENSITIVITY of the study than about the phenomenon being studied. 1) With LARGE samples, statistical tests of the traditional null hypothesis become so sensitive that they can detect any difference between a sample result and the specific value that characterized the null hyp., even if the difference is negligibly small. 2) With SMALL samples, it is difficult to establish that anything has a statistically significant effect. c) Regardless of the true strength of the effect, the likelihood of rejecting the traditional null hypothesis is very small when samples are small, and it is virtually certain when samples are large. d) At best, a test of the hypothesis of no difference can provide evidence against the null hypothesis of no difference, no effect, or no correlations. But it provides little evidence for any particular alternative hypothesis. *3) What are the limitations of traditional significance tests?* a) Impedes researchers from formulating and testing hypotheses about specific nonzero values for parameters based on theory, prior knowledge, and/or estimates of parameters based on accumulated data in situations where they have theory or knowledge to go on. b) Logic of nil hypothesis testing seems askew because if a researcher has a theory that a certain tx has an effect, his theory is supported by rejecting another hypothesis (that there is no effect) rather than by making a successful specific prediction that is within the bound of measurement error of the observed value 1) Seems unreasonable to regard as support for a theory that some other hypothesis is rejected in favor of an alternative hypothesis that is so vague in its content (that there is a difference) *4) What alternatives exist to overcome these limitations?* a) *CIs and Effect Size Estimation* 1) Ordinary CIs provide more information than do p-values. Knowing that a 95% CI includes zero tells one that, if a test of the hypothesis that the parameter equals zero is conducted, the resulting p-value will be greater than .05. A CI provides both an estimate of the effect size and a measure of its uncertainty. A 95% CI of say, (-50, 300) suggests the parameter is less well estimated than would a CI of (120, 130). Perhaps surprisingly, CIs have a longer history than statistical hypothesis tests (Schmidt and Hunter, 1997). 2) With its advantages and longer history, why have CIs not been used more than they have ? Steiger and Fouladi (1997) and Reichardt and Gollob (1997) posited several explanations: (1) Hypothesis testing has become a tradition (2) Advantages of CIs are not recognized (3) Ignorance of available procedures (4) CI estimates not available in statistical packages (5) Sizes of parameter estimates are often disappointingly small even though they may be very significantly different from zero (6) The wide CIs that often result from a study are embarrassing (7) some hypothesis tests (e.g. chi-square contingency table) have no uniquely defined parameter associated with them (8) Recommendations to use CIs often are accompanied by recommendations to abandon statistical tests all together, which is unwelcome advice. These reasons are not valid excuses for avodiing CIs in lieu of hypothesis tests in situations for which parameter estimation is the objective

METHODS - Shrinkage p1 1) Define *"Shrinkage"* in the context of *multiple regression*. Osborne (2000) and Cohen et al (2003)

1) *Shrinkage definition*: In a prediction analysis, the computer will produce a regression equation that is optimized for the sample. Because this process capitalizes on chance and error in the sample, the equation produced in one sample will not generally fare as well in another sample (i.e., R-squared in a subsequent sample using the same equation will not be as large as R-squared from original sample), a phenomenon called shrinkage (Osborne, 2000).

PSYCHOMETRICS - DIF Previous comps question: "Why does DIF occur?"

Domain sampling explanation - items were not randomly sampled from the domain of possible items Multidimensionality explanation assumes that most items are multidimensional , but are fitted to models that assume unidimensionality

METHODS - Effects of Dichotomization on Correlation Previous comps question: "Some variables are naturally dichotomous, while others are dichotomized artificially. What effect does each type of dichotomization have on the correlation coefficient? How should you analyze the degree of association when on or both of the variables are naturally or artificially dichotomous?"

MacCullem et al (2002) *Type of Dichotomization* 1) One common approach is to split the scale at the sample median, thereby defining high and low groups on the variable in question; this approach is referred to as a median split. 2) Alternatively, the scale may be split at some other point based on the data (e.g., 1 SD above the mean) or at a fixed point on the scale designated a priori) *EFFECTS OF DICHOTOMIZATION* Such dichotomization alters the nature of individual differences. After dichotomization, individuals A and B are defined as equal as are individuals C and D. Individuals B and C are different, even though their difference prior to dichotomization was smaller than that between A and B, who are now considered equal. Following dichotomization the diff bt A and D is considered to be the same as that between B and C. The median split alters the distribution of X. Most of the information about individual differences in the original distribution has been discarded and the remaining information is quite different from the original. It seems to justify such discarding of information, one would need to make on of two arguments. First, one might argue that the discarded information is essentially error and that it is beneficial to eliminate such error by dichotomization. The implication of such an argument would be that the true variable of interest is dichotomous and the dichotomization of X produces a more precise measure of that true dichotomy. An alternative justification might involve recognition that the discarded information is not error but that there is some benefit to discarding it that compensates for the loss of information. REVIEW OF TYPES OF CORRELATION COEFFICIENTS AND THEIR RELATIONSHIPS Some variables that are measured as dichotomous variables are not true dichotomies. Consider for example, performance on a single item on a multiple choice test of math skills. The measured variable is dichotomous (right vs. wrong) but the underlying variable is continuous (level of mathematical knowledge or ability). Special types of correlations, specifically biserial and tetrachoric correlations, are used to measure relationships involving such artificial dichotomies. Use of these correlations is based on the assumption that underlying a dichotomous measure is a normally distributed continuous variable. *For the case of one quantitative and one dichotomous variable, a biserial correlation provides an estimate of the relationship between the quantitative variable and the continuous variable underlying the dichotomy. *For the case of two dichotomous variables, the tetrachoric correlation estimates the relationship between the two continuous variables underlying the measured dichotomies. *Biserial correlations could be used to estimate the relationship between a quantitative measure, such as a measure of neuroticism as a personality attribute, and the continuous variable that underlies a dichotomous test item, such as an item on a mathematical skills test. *Tetrachoric correlations are commonly used to estimate relationships between continuous variables that underlie observed dichotomous variables, such as two test items. If both variables are continuous, then you can use the uncorrected Pearson correlation (rxy), as there is no corrected correlation available to use. 1) Pearson correlation a) The ubiquitous, "regular" correlation that you know and love, there is no corrected form, and ranges from -1 to 1. If one variable is dichotomous and the other is continuous, then you can use the uncorrected point-biserial correlation (rpbis), or the corrected biserial correlation (rbis). 1) Point-biserial correlation a) Occurs when you correlate a right/wrong item score with a relatively continuous variable like a scale score or criterion. b) Range of values depends on underlying relationship and proportion getting item right/wrong; range is maximized when p=.50, otherwise, cannot obtain -1 or 1. c) NUMERICALLY equivalent to Pearson correlation. 2) Biserial correlation a) Similar to point-biserial b) Corrects for artificial dichotomy of the dichotomous item score c) If correction works, values range from -1 to 1; it is possible to obtain values beyond (-1, 1). d) NUMERICALLY ALWAYS GREATER than the point-biserial/pearson correlation (can be written as the point-biserial with a correction factor). If both variables are dichotomous, then you can use the uncorrected phi correlation (phixy) or the corrected tetrachoric correlation (rtetrachoric). 1) Phi correlation a) Occurs when two dichotomous variables, like two item right/wrong scores, are correlated b) Range of values depend on underlying relationship AND proportions getting items right/wrong; Range is maximized when px=py=.50, otherwise, cannot obtain -1 or 1 c) NUMERICALLY equivalent to Pearson correlation. 2) Tetrachoric correlation a) Similar to phi correlation b) Corrects for artificial dichotomy of both dichotomous item scores c) If correction works, values range from -1 to 1; it is possible to obtain values beyond (-1,1) d) Useful in factor analysis of items (but sensitive to guessing) e) Difficult to calculate

METHODS - Regression Coefficients in Multiple Regression Previous comps question: "As part of a validation study, you find that a social skills test has a strong and significant correlation with job performance. However when entered in a multiple regression along with other predictors (cognitive ability, personality traits, interview ratings), the regression coefficient for the social skills test is small and non-significant. Explain in detail how this can happen. What does this results tell you about the predictive validity of the social skills test?

Multiple regression is employed to account for (predict) the variance in an interval dependent, based on linear combinations of interval, dichotomous, or dummy IVs. Multiple regression can establish that a set of IVs explains a proportion of the variance in a DV at a significant level (through a significance test of R-squared), and can establish the relative predictive importance of the IVs (by comparing beta weights). The multiple regression equation takes the form y = b1x1 + b2x2 + ... + bnxn + c. The b's are the regression coefficients, representing the amount the DV y changes when the corresponding IV changes 1 unit. The c is the constant, where the regression line intercepts the y axis, representing the amount the DV y will be when all the IVs are 0. The standardized version of the b coefficients are the beta weights, and the ratio of the beta coefficents is the ratio of the relative predictive power of the IVs. Associated with multiple regression is R-squared, multiple correlation, which is the percent of variance in the DV explained collectively by all of the IVs. The regression coefficient, b, is the average amount the DV increases when the IV increases one unit and other IVs are held cocnstant. Put another way, the b coefficient is the slope of the regression line: the larger the b, the steeper the slope, the more the DV changes for each unit change in the IV. The b coefficient is taht unstandardized simple regression coefficient for the case of ine IV. When there are two or more IVs, the b coefficient is a partial regression coefficient, though it is common simply to call it a "regression coefficient" also. The beta weights are the regression (b) coefficients for standardized data. Beta is the average amount the DV increases when the IV increases one SD and other IVs are held constant. If an IV has a beta weight of .5, this means that when other IVs are held constant, the DV will increase by half a SD (.5 also). The ration of the beta weights in the ratio of the estimated unique predictive importance of the IVs. Note that the betas will change if variables or interaction terms are added or deleted from the equation. Reordering the variables without adding or deleting will not affect the beta weights. That is, the beta weights help assess the unique importance of the IVs relative to the given model embodied in the regression equation. Note that adding or subtracting variables from the model can cause the b and beta weight to change markedly, possible leading the researcher to conclude that an IV initially perceived as unimportant is actually an important variable.

PSYCHOMETRICS - Explain CTT 1) Explain the basis of CTT.

To understand CTT, first ask what is the purpose of a psychometric theory? 1) To understand and control 2) To make tests better (more reliable and valid; sometimes other things) 3) A psychometrician's main job is to characterize (and hopefully curtail) measurement error ***Reliability and measurement error are opposing ends of the same concept...what makes one test more reliable than another? Imagine we could administer a test, wipe the test-taker's mind, and repeat infinitely. We would not expect the participant to obtain the exact same score on each occasion--particularly if the test-taker's condition (fatigue, health, stress, etc.) and testing conditions (quiet/noisy, early/late, hot/cold, etc.) were randomly sampled. CTT provides a simple model of this process. The main benefits are 1) understanding the process and 2) characterizing and controlling error BASIS FOR CTT 1) Each person has an observed score (X) - generally, the # of correct answers 2) Each person has a true score (T) for an exam which is the mean of an infinite # of independent administrations (see above). That is T=the sum of X 3) The observed score X is the true score T obscured by some random error (E): X=T+E F r e q __________________T_________________ O b s e r v e d s co r e (X) **Note that true score is not an attribute of the person, it is an interaction of the person and the test. -A person's true score is only defined with respect to a specific test -A person has different true scores on each test, even on tests that measure the same trait -Two people with differing, extreme trait standings could have the same true score (but this is a weakness of the test) -Change even a single item and the entire test is different (along with different true and observed scores for the test-takers) ***It is critical that all errors be independents (E does not correlate with anything)------> this is an assumption of CTT

Comps - Methodology

Set pelajaran terkait

AP 2017 Class Exam MCQ

PHSC 1001- Ch. 9 MasteringPhysics

CH 17 EAQ Preoperative Care

AH Exam III Final

Chapter 2: The Demand Curve

American Literature Poetry Test

Unidad 3 Geografía Humana

8: judges

Chapter 5: Smartbook

Health Disparities Final Bartosz

BA 370 Chapter 16

Rome quiz 5

BIO 114: Photosynthesis

Chapter 5 prebuilt

KAPLAN FINAL #2

PSY 314 Behavioral Neuroscience - EXAM 1 pt. 2

BIO 112 Chapter 39

KNR 280 Exam 1

Life insurance study set

Java 8 OCA Chapter 1