Research Stats 2: Exam 1

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Thompson Article

- -- authors should always report one or more of the indices of "practical" or "clinical" significance or both stat sig - estimates the probability of sample results deviation as much or more than do the actual sample results from those specified by the null hypothesis for the population, given the sample size - do not evaluate the probability that sample results describe the population - tests assume the null exactly describes the population and then test the sample's probability - stat sig test does not tell us what we want to know - tests evaluate whether my results were due to chance - does not evaluate whether results are important - ps cannot be blithely used to infer the value of research results Practical significance - size of difference between A and B and the error associated with our estimate, knowing A is greater than B is not enough - evaluating practical noteworthiness of results - effect sizes particularly important because stat tests so heavily influenced by sample sizes Clinical significance - effect sizes indicate practical sig - whether intervention makes a real difference in everyday life to clients - help people cope with symptoms ro improve quality of life - stat sig not sufficiently useful to be invoked as sole criterion for evaluating noteworthiness of counseling research - by how much is therapy A better - large practical effects do not assure clinically sig effects, large effects are more likely to be clinically sig than small ones - two major classes of effect sizes: standardized differences and variance-accounted-for indices - mean diff not suitable index of intervention effect - meaning of 1.0 diff depends entirely on scale of measurement - researchers presenting means should always provide standard deviations of scores of mean - scaling problem can be addressed in interpreting mean differences by standardizing the difference - cohen's d invokes a standard deviation estimate that is pooled or averaged across both the intervention and the control groups - both groups provide info about score scaling and that a more stable estimate would be achieved by using a larger sample size derived from both groups - standardized difference of about .5 is medium, .2 and .8 and small and large - effect tells the researcher what percentage of the variability in individual differences of the participants on the outcome variable can be explained or predicted with knowledge of the scores on the predictor variables - effect tells researchers what percentage of variability in individual differences of participants on the outcome variable can be explained or predicted with knowledge of the groups or cell membership of participants (eta) - can correct effect sizes for influences - when invoke corrections, shrunken estimates will always be equal to or less than our original uncorrected estimates - The difficulty is that all GLM analyses (e.g., ANOVA, regression, descriptive discriminant analysis) capitalize on all the variances in our data, including variance that is unique to a particular given sample. This capitalization results in an inflated variance-accounted-for effect size that is posi- tively biased (i.e., overestimates the true population effect or the effect in future samples). - First, as would be ex- pected, studies with smaller sample sizes involve more sampling error. Second, studies involving more measured variables have more sampling error; this is because there are more opportuni- ties to create sample "flukiness" as we measure more variables. Third, there is more sampling error variance in studies con- ducted when the population effect size is smaller. - Because we do not actually know the true population variance-accounted-for effect size, we typically use the actual sample value (e.g., η2, R2) as the estimated popula- tion effect in our corrections. - Conversion of effects into each other's metrics. As noted previ- ously, standardized differences are in an unsquared standardized metric, whereas variance-accounted-for relationship effect sizes are in a squared metric. These metric differences can be surmounted to convert these effects into each others' metrics. - Proposed "corrected" standardized difference. The facts (a) that variance-accounted-for effect sizes can be computed in all parametric analyses, given the general linear model, and (b) that effects can be converted into squared or unsquared metrics, suggest the possibility proposed here of computing a "corrected" standardized-difference effect size. - bias the "uncorrected" variance-accounted-for effect size also introduces biases in the standardized differences. - A "corrected" standardized difference (d*) can be computed by (a) converting a standardized difference (e.g., d) into an r, (b) converting the r into an r2 by squaring r, (c) invoking a sampling error variance correction formula (e.g., Ezekiel, 1930) to estimate the "corrected" effect r2*, (d) converting this corrected r2* back into r*, and then (e) converting the "cor- rected" r* back into d*. - This standardized difference has "shrunken" (from +.50 to +.42) once the sampling error influence has been removed for the original effect size estimate. The shrunken value is more conservative, but most importantly is more likely to be accurate and replicable. - the reporting of confidence intervals. Indeed, the new APA (2001) Publication Manual noted that confidence intervals "are, in gen- eral,thebestreportingstrategy - This is because sample size directly affects p values, and thus "virtually any study can be made to show significant results if one uses enough subjects" (Hays, 1981, p. 293). The problem is that when different studies involve different sample sizes, p values will differ in each study, even if every study had exactly the same effect size - The calculated p values in a given study are a function of several study features, but are particularly influenced by the confounded, joint influence of study sample size and study effect sizes. - What we seek in evaluating clinical interventions are indices characterizing (a) the typical effect and (b) the range of clini- cal effects across studies. Calculated p values are not sufficient for this purpose. Effect sizes, on the other hand, are useful quantifications of interventionimpactsinasinglestudy.Andeffectsizesarepar- ticularly valuable when we (a) formulate anticipated study effects prior to the intervention by consulting effects from pre- vious related studies and (b) interpret actual study effects once the study has been conducted in the context of prior effects.

CBL Chapter 15 - Questionnaire

- A questionnaire involves a single item to assess each construct, and typically is brief in length because participants are unwilling, unable, or unlikely to take part in a longer assessment. - A rating scale involves multiple items to assess each construct and typically is used by researchers with access to participants more willing to take part in a longer assessment - From earlier chapters, remember that the benefit of assessing a target with multiple items reflects our acknowledgement that a single item is insufficient to capture or triangulate a construct meaningfully. - The first rule is to determine what you want to know. The more direct the phrasing of the ques- tion, the more likely is its true meaning understood. Complicated questions are more likely to be misunderstood or misread and less likely to provide the sought-for information - A second rule of thumb is to use short, simple sentences containing a single grammatical clause, if possible and if needed, when developing queries. - This rule helps avoid the use of a double- barreled question—a single item that asks more than one question at once - A distinct advantage of questionnaires over scales is the capacity to use items with an open-ended response format. An open-ended item is one that poses a question but does not constrain the answer. - does not force respondents to choose among a limited set of response options. - In circumstances like these, question order may become crucial. In questionnaire development, the analogue of the rapport-building process requires that the least threatening items be presented first. Only after the respondent has become comfortable with the research, and somewhat committed to it by virtue of answering a number of questions, should more personal or threatening questions be presented. - This is the reason why demographic characteristics, which involve rather sensitive personal information (e.g., income, race/ethnicity), should be posed near the end of a questionnaire. Sometimes the ordering of items can keep a respondent in a study, and this is not a trivial concern. - Another issue related to question order has to do with the possibility that one's earlier answers may affect later ones - To combat it, some researchers recommend that questionnaire developers who fear the problem of influence from an early item counterbalance or randomize the order of questionable items across participants - A particularly difficult issue that affects the likelihood that a respondent will complete all items of a questionnaire is the inclusion or non-inclusion of a "no opinion" (or "don't know") option. - suggest that including a "no-response" option may encourage respondents to avoid the cognitive work involved in answering survey items, and thus discourage its use. In general, then, the advantages of allowing a middle option seem to be outweighed by the negative possibilities. In cases in which a good understanding of participants' feelings about a given issue is particularly critical, it is advisable to provide measures that allow for a clear interpretation of the meaning of the nonresponse option, if it is used

Video - sickle cell

- if inherent 1 copy go gene, unaffected by disease, protected from malaria - H0 no assoc btw. malaria and Hbs - H1 assoc between Malaria and Hbs

CBL Chapter 3 - Internal Consistency

- internal consistency is concerned with the extent to which the components (e.g., individual items, observations) of a measuring instrument are interrelated. The idea of internal consistency is usually applied to a measure—such as an ability test or attitude scale—that consists of a set of individual items. It is assumed that all the items of the scale measure the same underlying construct. - The same logic is applied when the "measuring instruments" are human observers, or judges. In this case, the question is, "Have the judges seen the same thing (as inferred from their giving more or less identical scores to the observations)? - The answer to the question is assessed by the extent to which the observers' observations overlap or correlate. - If the items that purportedly constitute a scale assess a variety of different constructs (i.e., if the scale is multidimensional), then there is little to justify their being combined as a representation of a single construct - As Nunnally (1967, p. 251) has observed, "a test should 'hang together' in the sense that the items all correlate with one another. Otherwise, it makes little sense to add scores over items and speak of total scores as measuring any attribute." To justify the combination of items in deriving an individual's overall score on such a test, the internal consistency of the item set must be established. - One of the earliest techniques to assess the internal consistency of a scale is a technique known as split-half reliability. Split-half reliability is assessed by randomly dividing a scale into two sets containing an equal number of items, both administered to the same respondents, with a test of relatedness calculated between these two summed scores - If there is a high degree of interrelatedness among items, then the relation between total scores from the two halves of the scale should be strong, thus indicating that the items are focused on the same underlying attitude or aptitude. If the two halves of the measure do not "hang together," this suggests that the scale items might not all be measuring the same underlying construct - This approach, called Cronbach's alpha, is an index of the hypothetical value that would be obtained if all of the items that could constitute a given scale were available and randomly put together into a very large number of tests of equal size (Cronbach, 1951). The average correlation between all possible pairs of these "split-half " tests is approximated by coefficient alpha. - Cronbach's alpha can range from .00 to 1.00, with the degree of internal consistency usually considered acceptable if this coefficient is .75 or better, though the actual value depends on the extent of error that the investigator is willing to tolerate.2 - From the internal consistency computational formula presented, we can infer that the number of items in a scale plays an important role in the scale's (internal consistency) reliability, as do the interrelationships that obtain among the items. If the items are highly interrelated, alpha will be high. - the more items, the greater the scale's coefficient alpha will be. Thus, one simple tactic of enhancing alpha is to "lengthen" the scale—that is, to add items to it. - Adding a good item to a 5-item scale will have a much greater effect on internal consistency than adding a good item to a 15-item scale. If the average correlation between items is reasonable (say, greater than .25), adding an item to a scale already containing 9 or 10 items will have relatively little effect on coefficient alpha. Furthermore, the higher the existing inter-item correlations, the less the effect of added items. - If a specific item is measuring something very different from that of the others in the item set, its relation with the total score will be weak. This information will alert the scale developer that this particular item can be deleted and substituted with one that (hopefully) better represents the concept under investigation. - Ideally, the items of a scale should share a common focus—but they should be entirely different in all other aspects that are irrelevant to this focus. - Such heterogeneity of item content will produce some inconsistency of response (and hence, lower alpha), but as long as many such items are used, all focused on one common construct (though, perhaps, numerous non-common determinants as well), the total set of items will provide a better measure of the central attitude than any single item

3 Types of reliability: 3

3) Interrater Reliability - degree that different coders (judges/researchers) are consistent (agree with each other) in rating a measure - cohen's kappa (for a categorical variable rated by 2 coders) - range -1.00 (perfect disagreement) to 0 (chance agreement) to 1.00 (perfect agreement) - aim for high integrator reliability - diff coders agree on observation

SPSS: Recode Values

Purpose: recode values of an original variable to create a new variable with fewer levels for statistical analysis - why? the research might be interested in a new variable of clinical diagnosis classification (satisfy minimum # of symptoms for disorder vs. not) or compare alcohol users vs. nonusers

Standardizing (z scoring) variables

compare person's score vs. sample mean in standardized units z score = X-M/s X = person's raw score M = mean s = standard deviation mean of z score = 0.00, s = 1.00 - negative z score, score below the mean - z score of 0 - score of the mean - check if Pearson r sig diff from zero

CBL Chapter 4 - Types of validity

- measurement construct validity, which is concerned with the validity of the measure with respect to the theoretical construct of interest - Face validity is based on superficial impressions regarding the extent that a measure appears to capture a construct - Content validity is the extent that measure adequately represents (or samples) the complete range or breadth of the construct under consideration. - With factual materials (i.e., when developing tests of knowledge or ability), constructing scales with adequate content validity is not overly difficult. The domain of interest is relatively well specified, and a representative sample of items can be drawn from this pool of potential questions. When the researcher is dealing with other psychological or social variables, however, the situation typically is not so clear-cut. - The first approach is to perform a thorough literature review of the topic to acquire insights about the appropriate types of items or factors that would ensure representative coverage of the construct. - A second means of promoting content validity is through the use of experts, perhaps assembled in focus groups. - The most empirical way to evaluate content validity is to use factor analysis to assess the adequacy with which various features of the construct are represented in the measure as a whole (Bryant, 2000). Doing so will reveal not only the number of factors or content areas emerging from the scale, but also the number of items that tap each factor - Criterion validity is concerned with the extent that a measure is related to, or explains, a target outcome or criterion, usually a behavior. - Concurrent validity is the extent that a measure is related to, or explains, a relevant criterion behavior, with both variables assessed at the same occasion. T - The problem with concurrent validity is that it is impossible to unravel the temporal precedence of events. - Predictive validity is the extent that a measure is related to, or explains, a relevant criterion behavior assessed at a subsequent occasion. It is considered the more desirable and rigor- ous of the two forms of criterion validity. - Predictive validity is of major concern when the purpose of a measure is to anticipate either the likelihood, or the extremity, of some behavior or outcome of interest at a later point in time. The behavior itself serves as the criterion, and the strength in which the scale predicts the level of the criterion is taken as an indication of the scale's predictive validity. - First, as other variables may influence the magnitude of a relationship, predictive validity in and of itself is not sufficient to confirm or refute the validity of a measure. - Whereas a strong predictive relationship is certainly encouraging, it does not explain why such a relationship occurred. In the absence of theory relating the measure to children's performance, the validation process is much less informative, and much less useful, than it needs to be. - A third, more practical limitation on the usefulness of prediction in establishing the validity of a scale has to do with the relative absence of useful criteria in social science. - For reasons like these, predictive validation approaches are most widely used when dealing with scales of fact—that is, for issues on which there are consensually agreed-upon answers. - Testing whether this is the case is termed convergent validity, or the extent that measures of constructs that are theoretically related are actually related. - Discriminant validity is the extent that measures of constructs that are theoretically unrelated are actually unrelated. - In general, construct validation is an approach whose aim is to establish the reality of a psychological concept—it is a test of whether or not the hypothesized construction plausibly exists. If it does, then it should enter into predictable patterns of relationships with other constructs. - Operations that assess the relationships between a new measure and other established measures with which the new test was thought to relate were termed convergent validation techniques by Camp- bell and Fiske (1959), because in essence the measures converge upon, or define, a hypothesized network of interconnected traits, processes, dispositions, and/or behaviors. A successful convergent validation operation not only suggests that the critical scale is an adequate measure of the construct in question, but also bolsters the theoretical position that was used to develop the hypothesized interrelationships that formed the basis of the validation process.

Advanced Reliability Analysis: Confirmatory Factor Analysis (CFA)

CFA: to determine factor structure of a measure (must have hypothesis about the number of factors, which items should correlate with which factors, and whether the factors are correlated) - relevance to reliability analysis: assess reliability of items (how strong items "hang" together) in each factor/subscale of measure - how well items hangs together with other items - e.g. .31 - suggests item does not hang together well with other items - unreliability of each item not hanging together with other item

Effect Sizes for Social Sciences

Cohen's d - small = 0.20 - medium = 0.50 - large = 0.80 Eta-squared - small = .01 - medium = .06 - large = .14 Correlation (r) - small = .10 - medium = .30 - large = .50 - interpret absolute value of effect size regardless of sign

Lac - Criterion Validity

Criterion validity is the degree that the scale is correlated with or is able to explain variance on a pertinent outcome Concurrent validity is evidenced if the scale is statistically related to an outcome, with the caveat that both factors are administered cross-sectionally (at the same time). This is considered the weaker form of cri- terion validity because the temporal direction of processes or events is ambiguous. Predictive validity, in contrast, is tested by determining whether the scale assessed in an ear- lier round is statistically related to a behavioral outcome at a later round. Thus, a longitudinal design is obligatory to evaluate predictive valid- ity.

Warner Chapter 10 - Correlation and causal inference

- A statistical relationship between X and Y by itself does not imply causation - Evidence that X and Y co-occur or are stat related necessary condition for any claim that X might cause or influence Y - Stat association necessary but not sufficient conditions for causal inference

CBL Chapter 3 - Item Response Theory

- Item response theory (IRT) is a technique to determine how each item operates in terms of difficulty, discrimination, and guessing with respect to the overall scale. - The basic logic of IRT is to estimate how item responses in a test are related to performance on the test as a whole. The relationship between the probability of item endorsement and perfor- mance or skill on its latent factor is evaluated. - This is represented by an item characteristic curve or item response function, an S-shaped probabilistic curve that best approximates a model of the data - A response probability, ranging from 0.00 to 1.00, reflects the likelihood that an item will be correctly answered (endorsed), given any person's location on the latent factor. Probabilistic language is used in IRT, as there is no guarantee that a particular person will respond a predetermined way, but across participants, an item characteristic curve estimates the probability of giving an affirmative response depending on where participants are situated on the underlying attribute. - An item with a .25 probability suggests that 25% of participants at that point on the latent factor endorsed or correctly answered this particular item. - Unlike factor analysis, IRT is inappropriate for analyzing a multidimensional scale. Only a single ability factor may be modeled in item response theory. - The assumption of local independence requires that all items be statistically independent. - The relation between the ability factor and the probability of an item response is described using up to three parameters or characteristics - The difficulty parameter is the first property of an item characteristic curve, and provides information on how likely people are to endorse or correctly answer an item relative to the other items. - An item response analysis containing the difficulty parameter only is known as a Rasch model - An item deemed to be easy, or one that has a high likelihood of a correct or positive response is represented by an item characteristic curve situated in a location more to the left of the latent continuum. - On either extreme of the latent factor, we find individuals who scored the lowest or highest on the total test, where each item has near 0 or 1.00 probability of being endorsed. I The discrimination parameter is the second property of the item characteristic curve and offers - The discrimination parameter is the second property of the item characteristic curve and offers information on how well an item successfully separates people of a particular ability on the con- struct. Discrimination is gauged by inspecting the slope for the item characteristic curve. - An item possessing a steep slope that points more vertically is more effective in distinguishing participants' performance on the factor. An item possessing a slope resembling a horizontal line, however, is uninformative in discerning performance levels on the construct - The guessing parameter is the third and final property of an item characteristic curve, and offers information about the likelihood that an item is vulnerable to being endorsed or answered correctly. Guessing may be understood as the probability of endorsing a particular item for people who had the lowest total performance on the scale. - A practical advantage of item response analysis is that test developers are able to estimate the ability level of participants who have completed only a few items selected from the entire inventory (Reid et al., 2007). This time-saving application is found in computerized adaptive testing, whereby new test takers are administered only a small subset from a large pool of questions - A person deemed to possess low ability, due to poor performance early on the test, will continue to receive "easy" questions determined in previous test administrations to have a high rate of being answered correctly by most people. A person showing high ability early on will continue receiving more "difficult" questions that have a low probability of being answered correctly by most people - Because adaptive testing does not require the administration of the entire pool of items, it is an efficient approach for isolating and narrowing any person's perfor- mance during the exam. A person's overall score is calculated by locating or linking where he or she falls on the ability continuum in comparison to others tested, whose scores were used to establish the response curve norms. Conversely, just from knowing a person's theta level, it is possible to make inferences about the probability of correctly answering an item a respondent never received.

CBL Chapter 15 - Rating Scale - Guttman and Likert

- The hallmark of Guttman's method is that it presents participants with items of increasing extremity with regard to the issue under investigation. If the scale is of high quality, an individual who endorses an item at a given level of extremity (or favorability) also is expected to endorse all less extreme items. Under ideal conditions, knowledge of a participant's total score would enable the investigator to reproduce exactly the individual's pattern of responses - Likert's model proves not only more efficient in terms of time and resource expenditure, but also more effective in developing scales of high reliability - for each item, in Likert's method respondents indicate the degree or extent of agreement or disagreement to each item using a "multiple-choice" format. On each item, respondents pick one of (usually) five options indicating the extent to which they agree with the position espoused in the item. Response options commonly presented are "strongly agree," "agree," "neutral or undecided," "disagree," and "strongly disagree." - The summation or averaging process across items is used to calculate a composite scale score, which is an implicit recognition that any single item is at best a fallible representative of the under- lying construct or attitude it is intended to represent. By combining a participant's responses over many such items, however, we hope to minimize the "noise" or measurement error that the imperfections of each item contribute to the overall score (especially if the items have different sources of measurement error), thereby arriving at a more internally reliable measure of the construct - We assume that all of the "operations" (i.e., items) will miss the mark to some extent (that is, no single item will perfectly capture the construct it is intended to represent), but we attempt to design items so that they miss the mark in different ways; thus, the resulting scale (i.e., the total or averaged score across all items) should provide a more sure identification of the construct of interest than any single item. - Then the complete matrix of intercorrelations between all pairs of items and between each item and the total score is calculated. Statistical software is used to compute a correlation matrix, which provides auxiliary information to allow for the calculation of coefficient alpha as an index of internal consistency (Chapter 3). Higher correlations generally yield a higher Cronbach's alpha. However, it is important to realize that Cronbach's alpha produces a higher coefficient as the correlation across items becomes stronger and as the number of items increases. Thus, using a large number of items in this initial scale construction phase will result in a strong alpha coefficient if the choice of items was at all reasonable. - The investigator's primary job at this point is to retain the items that form the best scale, and to discard items that correlate poorly with the rest of the scale. Coefficient alpha, an estimate of the internal consistency of the entire set of items, is not useful in an item-by-item analysis of this type. Of more practical utility is the investigation of each item's correlation with the total score, known as the item-total correlation - After having decided on the best items and discarded the worst ones, it is necessary to recalculate the item-total correlation using the "reduced set" of items, because the total composite score changes every time an item is discarded. - An alpha coefficient of .75 or higher suggests that the scale is reasonably internally consistent. However, some "shrinkage" in the reliability coefficient must be expected when the item set is readministered to another group, because this scale construction process capitalizes on sample-specific variations (i.e., error). - If coefficient alpha is weak (for example, if it falls short of an arbitrary value of .70), the internal consistency can be improved by the addition of more items that correlate positively with the original set and with the total score.

5 Types of Measurement Validity: 4

4) Convergent validity - extent measures of different constructs are theoretically related are statistically related - your new measure is correlated with established measures - demonstrate your measure correlated with similar other measures - e.g., sensation seeking scale with measure of openness to experience, impulsivity, hyperactivity - aim: moderate (but not high) correlation - renders new measure obsolete, call something diff name, but same measure

Confidence Intervals (CI) Application 2

Application 2. Determine if a sample mean is sig. different (p < .05) than a mean score of exactly zero Not significant: CI of mean overlaps with 0 - M = 4.00 [95% CI: -2.00 to 10.00] p < .05: CI mean does not overlap with mean of 0 - M = 4.00 [95% CI: 2.00 to 6.00] Example: the mean number of times of marijuana use (M = 4.00) in the sample for School B was significantly different than a mean of zero times marijuana use [95% CI: 2.00 to 6.00] 100% probability minus 95% CI probability = 5% - equivalent to significance testing at p < .05

Lac - Convergent Validity

Convergent validity concerns the degree the scale is statistically associated with preexisting scales that are theoretically related. The objective when testing convergent validity is to establish moder- ate positive statistical associations with other sim- ilar, yet conceptually distinct, scales that have been previously validated in the literature. Dis- criminant validity, in contrast, concerns the extent that the scale is uncorrelated with preexisting scales that are theoretically unrelated. If statistical analyses reveal extremely high or near-perfect correlations between the new scale and preexisting scales, this indicates that the emerging scale would be a redundant measure- ment instrument if introduced into the current literature.

4 Components of Statistical Power

Whether any statistically analysis is stat. sig (or not) is determined by 4 components SSSE 1) Statistical power - range .00 to 1.00 - e.g., .80: this analysis will have 80% likelihood of attaining statistical sig. - determined by a, N, and ES - greater power --> more likely analysis will be stat sig. 2) Significance criterion (a) - critical rejection cut-off used to judge whether analysis is statistically significant beyond change - a = .05 (more likely sig) - a = .01 - a = .001 (less likely sig) - larger a critical region -- more likely analysis will be stat. sig 3) Sample size (N) - total # of participants - larger N --> standard/sampling error decreases --> more likely analysis will be stat sig. 4) Effect Size (ES) - practical importance (magnitude) of mean difference or relation - e.g., Cohen's d, eta-squares, correlation - Larger ES --> more likely analysis will be stat. sig. Statistical power is a function of a, N, and ES - if analysis stat sig - likelihood actually sig - Type I error - should not had been sig - due to change, flaws in design, confounds - Type II error - find nonsig results results should have been sig

Cohen's d

a measure of effect size that assesses the difference between two means in terms of standard deviation, not standard error - SD pooled - aggregated Sd across two groups - if exactly same SD of 2 groups, SD pooled - average of 2 - if diff, apply formula

Warner Chapter 10 - Effect of extreme bivariate outliers

- Unusual combo of values of X and Y - Isolated data point that lies outside the cloud of scatterplot - Presence of bivariate outlier can inflate value of correlation - Decide what do with outliers before correlations

Sample Size and Effect Size

- sample size and ES: Independent of each other - large sample size or large ES (or both) could be responsible for sig. p-value - haystack metaphor - sample size = # of people searching for needle, more people, faster, more quickly find needle - effect size = larger needle, easier to find needle

Measurement validity

Measurement validity (aka construct validity) - degree of relation, or overlap, between a measure (or item, questionnaire, scale, device, etc.) and construct - measure "correct" (in assessing construct)? - extent that a measure is assessing accurately the underlying construct - assessing construct correctly

Reverse scoring

purpose: for items phrased in negative direction, necessary to reverse score - want higher response values of new variable to represent higher sensation seeking - only use coded/recoded variables in same meaningful direction (so that higher scores represent more of the underlying construct) to compute Cronbach's alpha and the mean composite

SPSS: Filter

purpose: select a participant subsample for statistical analysis - why? the research might be interested only in participants who are of a specific gender, race group, age range, or characteristic (e.g., nonuse of alcohol) - instead of deleting participants, don't need to - original dataset in tact

Pearson r Standardizing

standardizing (z scoring) - so that a variable's original metric is no longer a concern, Z scoring is a way of transforming it into a new common (standardized) metric - transform to be "free" (released" from its original metric - after z scoring variable, M = 0.00 SD = 1.00 - this is the reason r always ranges from -1.00 to +1.00

Confidence Intervals: Why Important?

- When encountering confidence interval information, ask yourself which of the 3 applications is being applied - Confidence gives us an additional information beyond the mean estimate - Interval of certainty/precision/confidence of the sample in estimating the pop - Narrower confidence intervals suggest a larger sample size - Election and other sampling polls tend to use Application 1 - Journal articles in medicine and the life sciences tend to report confidence intervals (Applications 2 and 3) instead of p values

Warner Chapter 13 - if scores measure same construct

- When we add together scores on a list of items or questions, we implicitly assume that these scores all measure the same underlying construct and that all questions or items are scored in the same direction. - To evaluate empirically whether these items can reasonably be interpreted as measures of the same underlying latent variable or construct, we usually look for at least moderate correlations among the scores on the items. - most widely reported method of evaluating reliability for summated scales, Cronbach's alpha, is based on the mean of the interitem correlations.

Measurement and Psychometrics

measurement reliability: consistency of a measure - is the measure "consistent" (in assessing a construct)? - repeatable, cane same scores occur when assessing a construct measurement validity: degree of relation, overlap, between a measure and construct - is the measure "correct" (in assessing a construct)? - to what extent is measurement reflective of what think measuring scale development and construction - methods of creating scales/questionnaires

Measurement reliability vs validity

reliability - consistency validity - truth - underlying truth of construct center of target = high validity when circles grouped together = high reliability reliability is a necessary but not sufficient condition for validity - if reliability low, impossible to have high validity - if high reliability, potentially high validity - e.g., all circles can be together but not close to center of target reliability is the upper limit on validity - your validity coefficient will never be higher than your obtained reliability coefficient - validity value can never be higher than reliability - if reliability is low, validity will be low - if items not hanging together validity guaranteed to be low

Video - CI

- confidence interval - range of values and confidence for range - normal distribution, representative sample - e.g., variable life of battery, performance of toy - mean, SD, sample size Assumptoins - independent observation - normal distribution or n > 30 - known standard deviation - correct results 95% of time

Chi-square Test: Statistical Assumptions

1) Cell sample size - a minimum of 5 participants in every cell of a cross tab table - satisfied: perform "chi-square test" - not satisfied: instead, perform "Fisher's exact test" 2) Independence of participants - each participant falls under only one level of the IV (and only one level of the DV)

Lac - Content validity

Content validity is concerned with the extent that the items representing the scale are comprehen- sively assessing the entire theoretical bandwidth of the construct. Adherence to the attributes, facets, and nuances of the conceptual definition is recommended to enhance content validity. a) opening a dictionary or encyclopedia and reviewing the standard definitions; (b) scouring previous research to foster insights about how the phenomenon has been conceptualized; (c) conducting interviews or focus groups with experts to solicit information about the construct; or (d) conducting an exploratory factor analysis on preliminary data to determine the number of underlying dimensions, how they dimensions are related, and how the items are related to dimensions. brainstorm with the aim of generating an array of pertinent items to capture fully the idio- syncrasies of the construct Multiple operationalism is the principle behind the creation of multiple items in an effort to establish construct breadth and serves the pur- pose of triangulating on the concept so that differ- ent indicators possess distinct and therefore nonoverlapping sources of measurement error.

Classical Measurement Theory

Example: self-esteem measure (0 to 150 point scale) remember: researcher only sees O (T and E are unknowns to researcher) Random error (chance event) examples - daily mood - sometimes mood higher or lower - sleep quality - sometimes sleep quality worse/better - researcher not paying attention during data entry Systematic Error (biasing event) examples - push scores one direction or the other (positive or negative) Increase mean scores - excellent weather - bright lighting in lab - researcher is smiling a lot at participants Decrease mean scores - snow storm - dark lighting in lab - research is frowning a lot at participants - with random error, observed score mean same as true score mean - don's see mean of observed score changing but SD is wider - random error cancels each other out to average random error = 0 - systematic error - erroneous perception of reality - what real score on self-esteem scale - observed score mean diff from true score mean - systematic error mean not 0

Nonexperimental Methods: Schematic View

Nonexperiment - researcher does not manipulate variables/factors (also includes correlational methods) - often rely on theory from prior research - predictor --> outcome - attitudes --> behavior Quantitative - results described with numbers and statistics. - examples: questionnaire construction, scale validation, psychometrics, quantitative content analysis - statistics: analysis of mean differences (t test, ANOVA, repeated measures ANOVA) - statistics: analysis of associations (correlation, multiple regression, factor analysis) Qualitative - results described with words and text - examples: interviewing, focus groups, ground theory, qualitative content analysis - no statistics used (quoting text, summarizing themes using words and paragraphs) Mixed Method - qualitative and quantitative - examples: interview that asks qualitative and quantitative questions, mixed content analysis - statistics: analysis of mean differences (t test, ANOVA, repeated measures ANOVA) - statistics: analysis of associations (correlation, multiple regression, factor analysis) - no statistics used (quoting text, summarizing themes using words and paragraphs)

Questionnaires vs. Rating Scales

Questionnaire (e.g., national polls) - a single item to assess each construct/factor - shorter in length: due to limited time availability of participants - each construct interested in tapped by one item Rating Scale (e.g., psychological inventories) - multiple items to assess each construct/factor - longer in length - more time availability of participants - diverse set go items greater sense content validity to assess construct - triangulating - not overlapping sources measurement error - multiple items assess each construct (e.g., big five dimension traits)

Statistical Significance vs. Effect Size

Statistical significance (p-value of analysis) - based on NHST (null hypothesis significance testing) - only gives us info that mean difference or relation is stat sig. beyond chance (standard/sampling error) - black and white view of results (choose one): stat sig or not - if N is high enough, any analysis could attain stat. sig! - e.g., N = 100, r = .02, p = .841, N = 10,000, r = .02, p = .046 - does not denote practical importance Effect size (ES) - denotes practical importance (or strength, potency, magnitude, explanatory value) - effect sizes stable regardless of N - effect sizes are the main metric in meta-analysis - increasing sample size helps rule out change but doesn't impact effect size - stat sig = probability of value of analysis - probability of obtaining diff given null hypothesis - when diff or relation stat sig beyond change - larger sample size, more likely rule out chance - narrower confidence interval

CBL Chapter 4 - Requirement of the MTMMM

- In their classic paper, Campbell and Fiske (1959) suggest that four important requirements be met before we conclude that our measures are valid indicators of a construct. The first is that the correlations in the validity diagonals (the monotrait-heteromethod values) be statistically and practically significant. This requirement is concerned with convergent validity. It is reasonable to require these values be strong because they are meant to express the association between different measures of the (presumably) identical trait. For most theoretical purposes, measurement method is considered incidental, or theoretically vacuous, and as such should not affect the values of the traits or constructs of interest. We have convergent validity when there is a strong overlap among the various assessments of traits that are considered identical (and which differ, presumably, only in the manner in which they were measured). Thus in our example, all three of our measures of self-esteem should correlate strongly—this is a reasonable requirement if we are to infer that they measure the same underlying construct. If they do not, it is possible that the measurement method we used to assess self-esteem, which plays no role in our theory, is impinging on our results. In Table 4.2, the correlational results satisfy this requirement. - The second (critical) requirement of the MTMMM is that the validity values exceed the entries in relevant heterotrait-monomethod triangles (which are bound by solid lines). This "common sense desideratum" (Campbell & Fiske, 1959, p. 83) requires that the relationship between different measures of the same trait should exceed the correlation between different traits that merely happen to share the same method of measurement. If this requirement is not met, we are left to conclude that systematic (but theoretically irrelevant) measurement error may be controlling outcomes to an unacceptable degree. - In the language of the MTMMM, the monotrait- heteromethod values should exceed associated heterotrait-heteromethod associations. This too, is a reasonable requirement. It means that the relationship between different measures of the (presumably) same trait should be stronger than correlations between different traits assessed with different measures. Discriminant validity is called into question if the association of different traits determined by disparate measures exceeds that involving identical traits (also measured by means of dissimilar methods). - The final requisite of the MTMMM technique is that the same patterns of trait interrelations be observed in the heterotrait triangles irrespective of measurement overlap; that is, the pattern of trait interrelations should be the same in the monomethod and the heteromethod blocks. Such a requirement would be met if method were truly incidental, as required.

Warner Chapter 1 - Effect Size

- Independent of sample size - Some have fixed range of possible values, but others do not - Many effect sizes are in unit free terms - Can be presented in terms of original units of measurement - useful when original units of measurement were meaningful - Some can be directly converted into other effect sizes - Cohen's guidelines for verbal labeling widely used - Value of test statistic depends on effect size and sample size or df - Many journals now call for reporting effect size info - Judgments about clinical or practical importance of research results should be based on effect size info - Cohen's d - describe difference btw group means for treatment and control groups - Raw or standardized regression slope coefficients can also be treated as effect sized in metal-analysis - Cis can be set up for effect size estimates - Judgments about theoretical sig sometimes made on basis of magnitude of standardized effect size indexes such as d or r - Given the effect size, how much does this variable add to our ability to predict some outcome of interest or explain variance - When effect size small, very large Ns needed in future studies to have sufficient stat power - variable that predicts proportion of variance - why not report effect size: SPSS does not provide info, CIs often embarrassingly large and effect size small - e.g., I have accounted for 1% of variance

CBL Chapter 3 - Parallel forms validity

- Known as parallel forms (or alternate, or equivalent forms) reliability, it is assessed by devising two separate item sets intended to assess the same underlying construct, administered to the same participants at the same time with degree of relatedness calculated. If a high relationship is obtained between scores on the two tests, it is interpreted as an indication of the reliability of the instrument(s). - One way to verify if parallel scales were obtained is to show that the means and standard deviations of the two tests are very similar, if not identical. The rationale here is the same as that of the split-half approach, except that the two ("equivalent") forms are considered whole tests. - However, it might also be the case that the parallel forms simply are not equivalent. In attempting to determine the reasons underlying a lack of interrelatedness between two theoretically identical measures, the investigator sometimes must devote more time than would be demanded in the development of an entirely new set of scales. - What's more, the information that a test is temporally stable usually is not considered sufficient evidence of a scale's reliability because it is possible that a scale could elicit stable responses across time and still not be internally consistent. - To satisfy the full set of criteria of scale reliability, it is desirable that the scale demonstrate both temporal stability and internal consistency. - Thus if a test is known to be temporally stable, then the explanation of changes between test administrations can be directed at external agents—e.g., a social intervention, a historical event, maturation, etc.

Warner Chapter 13 - reliability

- Reliability = consistence of measurement results - To assess reliability need to make at least 2 measurements for each participant and calc appropriate stat to assess stability or consistency of the score - Measure same set of participants at two points in time - test-retest reliability - quant X variable - Categorical X variable - reported by two diff observers or raters - percent of agreement btw pair of judges - Cohen's Kappa provides assessment of level of agreement between observes corrected for levels of agreement expected to occur by change - Multiple-item test - factor analysis and cronbach's alpha internal consistency reliability coefficient - assessing agreement about level of depression measured by multiple test questions or items - Pearson's r can be computed to assess stability or consistency of scores across two times - High value or r tells us high degree of consistency at diff times

Warner Chapter 13 - composite measures

- Using a score that corresponds to a sum or avg across multiple measurements provides a measure that is more reliable - Assess reliability or consistency of responses - Composite measures (scales) are generally more reliable than single scores, and the inclusion of multiple measures makes it possible to assess internal consistency reliability. Scores on composite measures generally have greater variance than scores based on single items. The distribution shape for scores on individual items using common methods response alternatives, such as five-point Likert rating scales, are typically non-normal. The distribution shape for scores formed by summing multiple measures that are positively correlated with each other tend to resemble a somewhat flattened normal distribution; the higher the correlation among items, the flatter the distribution for the sum of the scores - Scores on composite or multiple-item measures may be more sensitive to individual differences

CBL Chapter 4 - score a person receives on a measurement scale is never a pure

- We must acknowledge that the score a person receives on a measurement scale is never a pure, totally accurate picture completely determined by the attribute or construct in ques- tion, but rather is the product of a complex interaction of many factors, only one of which is the attribute or construct of theoretical interest. - Perhaps not surprisingly, a respondent's mood may have a strong impact on the responses he or she gives to a social query. - In many measures of social variables, the respondent is asked to present a self-report concerning some more-or-less important belief, value, attitude, or behavior. There are some situations in which an individual's actual beliefs, values, attitudes, and/or behaviors are not aligned with those approved by common social norms. Under such conditions, the respondent might be tempted to respond in a "socially desirable" way by misrepresenting true feelings and responding in a manner that is consistent with social mores. - A more tractable problem arises when a verbal measure uses language that is different from that characteristically employed by the respondent sample. - There is some evidence in the attitude scaling literature suggesting that reliable differences exist among people in terms of their tendency to employ (or to avoid) the extreme response options of rating scales - People's tendency to acquiesce to, or to agree with, positively worded statements is the final stylistic response trait to be considered.

Method of Equal-Appearing Intervals (Thurstone, 1928)

1) researcher generates many initial items (e.g., 130) of different attitude extremities to tap construct 2) judges (other researchers in lab) rate the degree of extremity of each attitudinal item (ignoring own attitudes) by assigned a "scale value" (e.g., 1 = extremely favorable, 11 = extremely unfavorable) 3) compute average "judges' scale value" (across judges' ratings) per item. afterward, select final items (e.g., 11) to represent the entire continuum of scale values to possess about "equal-appearing intervals" 4) administer final items to participants. participants indicate agreement or endorsement ("yes") to each item - participant's score: average of the "judges' scale values" of the items endorsed (said yes to) - if or if not extremely favorable toward that construct - trying to establish content validity - tapping entire bandwidth of construct

CBL Chapter 4 - multitrait-multimethod matrix

A matrix that includes information on correlations between the measure and traits that it should be related to and traits that it should not theoretically be related to. The matrix also includes correlations between the measure of interest and other same-methods measures and measures that use different assessment methods. - In the multitrait-multimethod matrix (MTMMM), multiple measures are used to assess the extent of association of theoretically related but different constructs, over and above the association that might come about simply from having shared the same method of measurement. - The technique involves computing a correlation matrix that reveals the relationships among a set of carefully selected measures. The measures are chosen to represent a combination of several different (theoretically relevant) con- structs, each assessed by several different methods of measurement. - Analysis focuses on the interrelationships among measures theorized to assess the same construct. These relationships are compared with those involving measures of different constructs that happen to be measured by the same measurement technique. The pattern of interrelationships among traits sharing the same and similar and different methods of measurement helps us determine the construct validity of the measures under study. - Four components of matrix before attempting to form assessment of convergent and discriminant validity - The reliabilities of the three methods of measurement, contained in the main diagonal of the matrix (parenthesized values), are considered first - heterotrait-monomethod tri- angles. These entries reflect the correlation between two different traits that are assessed by the same measurement method - heterotrait-heteromethod triangles, which are enclosed in broken lines. These values reflect the relationship between different traits assessed by different methods of measurement - Of course, the same sample of participants would be required to complete all of these methods to build a MTMMM. Although not limited exclusively to three trait factors assessed with three method factors, extending the matrix makes it unwieldy, as each new trait must be measured by every method. - The final entries to be considered are the monotrait-heteromethod values, which lie on the diagonal that separates the heterotrait-heteromethod triangles. These validity diagonals, as they are termed, reflect the association of presumably identical traits assessed by different methods of measurement - We use these various entries to assess construct validity - In terms of classical test theory, the underlying theoretical traits represent true scores, and dif- ferent assessment methods represent different sources of measurement error that influence the observed score - The MTMMM is an important theoretical device to help visualize and remind us to be mindful of the many methodological artifacts that can affect the validity of obtained scores.

Confidence Intervals (CI) Application 3

Application 3. Determine if 2 sample means are sig. different than each other (equivalent to t test) not significant: CI of the mean difference overlaps with mean difference of 0 - Group 1: M = 70.00, Group 2: M = 75.00 - Mean difference = 5.00 [95% CI: -1.00 to 9.00] p < .05: CI of the mean difference does not overlap with mean difference of 0. - Group 1: M = 70.00, Group 2: M = 75.00 - Mean difference = 5.00 [95% CI: 2.00 to 8.00] Example: Group 1 (M = 70.00) and Group 2 (M = 75.00) were significantly different in mean scores, Diff = 5.00 [95% CI: 2.00 to 8.00]. Example 1 (not sig): "A t test compared older males (M = 7.00, SD = 3.00) and females (M = 4.00, SD = 3.00) on the number of blood clots, Mdiff = 3.00 [95% CI: -2.00 to 8.00]. Example 2 (p < .05): "A t test compared the treatment (M = 5.00, SD = 2.00) and control group (M = 2.00, SD = 3.00) on number of days exercising, Mdiff = 3.00 [95% CI: 2.00 to 4.00]." Example 3 (SPSS): An independent t test found that female (M = 6.00, SD = 3.16) scored significantly higher than males (M = 2.00, SD = 1.58) on the outcome measure, Mdiff = 4.00 [95% CI: 0.35 to 7.65].

Effect Size Guidelines (Cohen, 1988 & 1992)

Cohen's d - comparing 2 means - small = 0.20 - medium = 0.50 - large = 0.80 definition: mean diff. in # of standard deviations (Range: -infinity to +infinity) Technical: one group score 0.33 standard deviation units higher than the other group. APA: That independent t test found that the treatment group (M = 105, SD = 15) scored higher than the control group (M = 100, SD = 15) on intelligence, t(250) = 3.52, p < .05, d = 0.33. Eta-squared - comparing 2 or more means - small = .01 - medium = .06 - large = .14 definition: proportion variance in DV explained by the IV (Range: .00 to 1.00) Technical: the three groups of the IV explained 25% of the variance in the DV. APA: the one-way ANOVA revealed that the three group means were significantly different overall on intelligence, F(2, 275) = 8.20, p < .05, n^2 = .25. Post hoc (LSD tests indicated that group 1 (M = 105, SD = 100) scored higher than group 2 (M = 100, SD = 100) on intelligence, p < .05, d = 0.33. Additionally, group 2...

Lac - measurement validity

Measurement validity refers to the degree of cor- respondence between an item, measure, scale, or instrument and the underlying theoretical concept. Measurement validity, also known as construct validity or measurement construct validity, is concerned with the extent that an item, measure, scale, or instrument reflects its underlying con- cept. Desirable measurement validity is evidenced if the measure accurately and appropriately assesses the construct that it is purported to cap- ture. continuum from poor to excellent. deep understanding of the ideas of measurement validity is crucial because not all concepts are directly observable. Researchers, however, could endeavor to quantify these abstract concepts by devising measures based on theoretical and conceptual definitions and administering them to participants Measurement validity is comprised of four main types: content validity, criterion validity, convergent validity, and discriminant validity.

Reliability: Classical Measurement Theory

O = T + Er+s O (observed scores) - score obtained on a measurement instrument (Test, questionnaire, scale, device, etc.) - what we see when calculating scores, entering on SOSS T (true score) - replicability or consistency of measurement instrument - proportion of measurement reliable across infinite "hypothetical" administrations - correct interpretation: extent that a measure is a consistent reliable assessment of the construct - incorrect interpretation: extent that a measure is correct (valid) assessment of the construct E (measurement error) - Er (random error): attributed to an unexpected "chance event" that artificially inflates and deflates the observed scores - thereby widening the variability (Variance) of the observed scores - random errors cancel out: mean (or total) of observed scores remains unchanged (some participants scoring higher and others lower) - Es (systematic error): attributed to an unexpected "biasing event" that artificially inflates or deflates (not both simultaneously) the observed score - systematic errors accumulate to either increase or decrease mean (or total) of observed scores - biasing = pushes scores of all participants in one direction - higher or lower - not simultaneously

CBL Chapter 12 - Interrater Reliability

- "The fewer the categories, the more precise their definition, and the less inference required in making classifications, the greater will be the reliability of the data." - First, the ratings of two persons observing the same event would be correlated, a measure that would rule out the errors of change in the person and the environment. - Next, the ratings of the same observer watching a similar event at two different times would be compared (this would rule out errors of content sampling.) - Then the agreement of two observers observing an event at two different times would be correlated. This measure . . . would be expected to yield the lowest reliability of the four comparisons. - Finally, the observations of a single observer watching a single event would be compared in a manner similar to odd-even item correlations in a test. This is a check on internal consistency or the extent to which the observer agrees with himself. - The most fundamental question of interrater reliability is, "Do the ratings of two or more observers who have witnessed the same event(s) coincide to an acceptable degree?" By acceptable degree, we mean beyond that which would occur by chance - Cohen's kappa is an index to assess the extent of agreement between two coders of a qualitative categorical variable while controlling for chance (Cohen, 1960, 1968). Kappa's value can range from .00 (no agreement whatsoever) to 1.0 (perfect agreement). - The diagonal entries in the table represent the number of times the coders have overlapped (i.e., agree) on a given category. The non-diagonal elements represent disagreements. The greater the number of entries off the diagonal, the lower the kappa and the lower the interrater reliability of the coding system.

CBL - 27-29

- "sampling error" = "standard error" (S.E. in SPSS statistical output) - Goal of eliminating or reducing plausibility of rival alternative hypotheses - Chance - Nonsystematic variation - variations from individual to individual - from one time to another withing individuals - uncontrolled variability among observations - potentially attributable to operation of chance - Stat analyses - assess possibility chance reasonable alternative explanation of any relational finding - Study's sample always susceptible to sampling error - outcome observed in particular sample by chance not perfectly consistent with what would be found if entire pop used - Uncertainty due to sampling error - partly responsible for use of probabilistic language and triple negatives - Inferential stats - assess likelihood obtained effect occurred by chance given null hypothesis true - Assign probability operation of chance due to sampling error as possible explanation of any relationship discovered - Results of stat inference test tell us probability of Type I error of inference - probability observed effect would have been obtained by chance if null hypothesis was true - Statistical significance = if probability of obtaining observed effect by chance so low render chance explanation implausible - Provide insight into likelihood chance is viable alternative explanation of findings - Type II error of inference - probability of failing to reject null hypothesis if it is false - really is a relationships between predictor and outcome variables - Reducing probability of Type II error - requires design studies with sufficient power to detect effect and beyond random variation - Statistical power = probability of obtaining a statistically significant effect if indeed that effect truly exists - Power of study depends on number of properties in study: number of participants (power increase as number participants increase), reliability of measures used (power increase when reliable measures used), strength (effect size) of relationships, Type I critical value used for statistical testing

CBL Chapter 3 - Common Factor Analysis

- Also known as common factor analysis, exploratory factor analysis is a technique to determine the underlying factor structure if the researcher does not have hypotheses regarding the number of underlying factors and the items that might constitute each factor. - The technique is data driven, as the analysis offers a factor solution that optimally represents the underlying interrelations among items - Confirmatory factor analysis is a technique to determine the underlying factor structure, if the researcher starts with hypotheses precisely stipulating the number of potential factors, which items should load on (or correlate with) which factors, and how the factors should be correlated. As the features of the factor structure are defined by the hypotheses, the result- ing analysis reveals the degree to which items load on the prespecified factors. - the latent factor that pro- duces or causes the variance of each item, and its measurement error (E1, E2, E3, etc.) that explains the remaining variance of each item - Item 4 yields a poor factoring loading that suggests inadequate representation of the factor. You could allow the item to load on the other factor and then redo the analysis, or delete the item if the modification does not yield a higher loading. The arrowheads of factor loadings always point from factor to items, because the unobservable phantom construct is thought to drive these various observable manifestations (items). - A factor loading indicates the extent that that latent factor, statistically derived from the item commonalities, in turn explains or causes each item. An item sharing more of its vari- ability with other items in the same factor is granted a larger factor weight, and is correspondingly contaminated by less measurement error. - Also diagrammed in Figure 3.1, each measurement error term (E1 through E7) contains the variance portion of an item that is not explained by the latent factor. Measurement error may be decomposed into two elements (Bedian et al., 1997). Random error of each item has been separated from the latent factor, and therefore the factor has been corrected for unreliability attenuation. Furthermore, systematic error also has been estimated for each measurement error term—but only in instances when an extraneous bias affects a single item - although latent factors are not directly observable (they are observed through their items), they can be estimated and subjected to hypothesis testing.

CBL Chapter 3 - Reliability and Validity

- Although all translations (that is, measures) are imperfect, individual measures vary in the adequacy with which they characterize the underlying conceptual variable of interest. - Briefly, reliability is the consistency that a measurement instrument assesses a given construct - validity is the degree of relationship, or the overlap, between a measurement instrument and the construct it is intended to assess. - The concept of reliability derives from classical measurement theory, which assumes that the score obtained on any single measurement occasion represents a combination of the true score of the object being measured and random errors that lead to fluctuations in the measure obtained on the same object at different occasion - The standard formula usually lists only random error; it does not take account of systematic error, or it combines it with random error.

CBL Chapter 3 - Latent factor

- Although no consensus exists regarding the features that constitute a contemporary test theory, one definition is that it is characterized by the search for a latent factor (Borsboom, Mellenbergh, & van Heerden, 2003), a characteristic or construct that is not directly measured or observed but that underlies responses on a measurement scale. - Latent factors are analogous to the "true" score component of classical test analysis. Measurement error becomes part of the scale score in classical testing, but its distorting effect is explicitly estimated and removed from a latent factor - The first type includes factor analysis and structural equation modeling - Classical testing places greater emphasis on the total scale; for example, the mean and reliability are calculated and reported to provide information about properties of the total scale. Contemporary testing, however, focuses on individual items, offering greater information into how each item functions in the context of the total scale. - n classical testing, as discussed, the observed score will contain not only the underlying true score but also measurement error. - Unfortunately, our inability to remove measurement error variance from the observed total score is the cost of using traditional classical testing. This contamination of an observed score is problematic when using the measure for hypothesis testing - Obscures detection of real association between two variables The correction for attenuation is an estimate of the correlation between the two variables if the measures were perfectly reliable. As the reliability coefficients are input into the denominator, a scale lower in reliability will show a larger improvement after the correction, as it contains a greater portion of random error. - After accounting for the unreliability in the scale scores (i.e., correcting for attenuation), the disattenuated correlation jumps to .53. However, if the scales were perfectly reliable (i.e., rxx = ryy = 1.0), the "correction" for attenuation would not change the original result. - A larger correlation to account for the effect of unreliability is obtained because the correction now represents the correlation between two perfectly reliable factors - First, in some instances it is possible to obtain out-of-bounds correlation coefficients (r > 1.0) after the adjustment. In other words, it is possible for the attenuation formula to overcorrect and inflate the estimate of the true relationship. A second issue concerns the most appropriate type of reliability to be entered into the formula. Although parallel form reliability has been suggested as the best estimator, others have argued for internal consistency and test-test approaches (Muchinsky, 1996). - Latent approaches statistically remove error—through iterative estimation procedures—by keeping the statistical variance consistent across items in the factor. - Again, true score exclusively pertains to replicability of scores and implies absolutely nothing about whether the construct was validly assessed.

Warner Chapter 10 - Correlation

- Before obtain correlation need to examine an X, Y scatterplot to see if association between X and Y approx. linear - Values of r range from -1.00 to +1.00 - Sign of r tells us direction of association - Absolute magnitude of r indicates strength of association - If near 0 little or no association - If one variable clearly predictor or causal variable - place on X axis - Values of r tend to be below .30 absolute value - Useful to think about way average values of Y differ across selected values of X - R of 0 tells us no linear relationship between X and Y - If X and Y completely unrelated, nonlinear or curvilinear relationship - Straight line not good description of pattern - Need to examine whether assumptions for linear correlation are satisfied

CBL Chapter 15 - Factor Analysis

- Factor analytic techniques may be divided into two types. Exploratory factor analysis (EFA) is a statistical technique to assess the multidimensional structure of a scale if the researcher does not have hypotheses regarding the number and types of underlying factors that will emerge in the solution. - Confirmatory factor analysis is a statistical technique to assess the multidimensional structure of a scale if the researcher has hypotheses regarding the number and types of underlying factors that will emerge in the solution. A hypothesis-based approach to factor analysis requires reviewing the relevant prior literature to make informed decisions about the how to phrase the items so as to represent each of the dimensions. - All three scaling models, for example, require relatively major expenditures of time and effort in the scale construction process, and all three techniques require the development of a new set of items each time attitudes toward a new person or object are to be assessed

Warner Chapter 10 - Reporting many correlations and inflated risk for Type I error

- If set risk for Type I error at .05 (reject H0 if p < .05), 5% correlations judged stat sig even if data completely random - When we run many tests and violate other assumptions, actual risk for committing Type I error larger than .05 - When run large numbers of significance tests, should view decisions about stat sig skeptically - Report results as purely exploratory to avoid problem of inflated risk - Make clear evaluation of stat sig problematic - don't include p values - P values provided by SPSS not corrected for inflated risk of Type I error - Limit number of correlations - limit risk Type I error - Drawback - preclude discoveries - Replicate or cross-validate correlations - Obtain new samples of data and see if same X, Y correlations sig in new batch of data as first batch - If relations btw variables remain sig, less likely instance of Type I error - Bonferroni procedure - use more conservative alpha level for tests of individual correlations - Set per comparison alpha level lower for each individual test - EW / k - EW - experiment-wise alpha, k = number of sig tests

CBL Chapter 15 - Semantic differential scale

- In Osgood's semantic differential scale, rather than asking respondents to respond to variations of statements concerning the concept under study (as in the Likert or Thurstone approaches, for example), the concept is presented directly and participants are instructed to react to it in the form of ratings on a number of bipolar adjectives - In semantic differential scales, a single stem indicating the construct to be judged is presented, followed by a set of adjectives pairs to be used in the judging process. In Likert scaling, different stems are used to describe the item to be evaluated, but the same response options (strongly agree, agree, etc.) are consistently used to capture responses. - Another obvious advantage concerns time savings for participants and scale constructors. Even with the same number of items, a semantic differential scale will be shorter in length, easier to read, more difficult to misinterpret, and therefore completed more quickly by respondents than a Likert scale. From the scale constructor's perspective, the Osgood approach is considerably more efficient. - The semantic differential approach offers many practical advantages. Given the nature of the statistical process through which the various factors or "clusters" of items were developed, it is safe to assume that the internal consistency of such a measurement instrument will be high, an assumption that always should be verified. - A strong coefficient of internal consistency and high item-total correlations indicate the items chosen to measure the attitude object are related to the overall composite. - Another response format issue requiring consideration concerns the number of response categories appropriate for an item. Items consisting of five and seven response categories are the most commonly encountered, but some researchers use a greater number - A tradeoff in the inclusion of a greater number of response options is the greater amount of time required to contemplate which of the many categories best corresponds to one's attitude toward an issue, and there is some question regarding respondents' capacities to make fine-grained distinctions in responding to attitude items. - Another consideration concerns the choice of using or excluding reversed items. Many scales contain a mixture of positively and negatively phrased (or reversed) items. Reversed items are used to guard against the threat to measurement validity known as acquiescence response bias, the penchant for some individuals to agree to item statements regardless of content. Acquiescent responding could occur because of a personality disposition to be agreeable or the desire to please the investigator.

CBL Chapter 3 - Factor Analysis

- In this circumstance, it is important to determine whether the items, in fact, do focus upon a single central, underlying construct (unidimensional), or if the scale is multidimensional, that is, if it taps a number of different constructs. - To evaluate the possibility that the scale is multidimensional, the entire matrix of item intercorrelations can be entered into a factor analysis. This type of statistical analysis provides the researcher with information regarding the actual number of constructs or "subscales" that may exist in the instrument under construction, as evidenced by the respondent sample. - Based on this information, the investigator may decide to retain only a subset of the original items (e.g., those that form the most internally consistent subset, as indicated by a reliability analysis on the various subcomponents of the overall instrument) and to develop additional items to add to this subset in constructing an improved scale. The other items would be discarded or used to create a separate scale or scales, with their own internal consistency coefficients, etc. - If we were to perform factor analysis on the item set, we might find that one group of items that "hang together" (i.e., have high internal consistency) all have to do with participants' feelings of obligation to future generations. Another set of items that hang together might all have to do with the financial implications of environ- mental depredations. - Eventually, an iterative process of factor analysis and reliability analysis will enable the investigator to generate a scale—or subscales—of acceptable internal consistency.

Warner Chapter 10 - Pearson's r and r^2

- Indexes of effect size - Standardized, independent of sample size N - R^2 called coefficient of determination - estimates proportion of variance in Y that can be predicted from X - Each circle rep total variance of one variable, area of overlap btw circles proportion to r^2, the shared or predicted variance - Remaining area of each circle corresponds to 1 - r^2, proportion of variance in Y not predictable from X - r of about .10 or less (r^2 < .01) = small, r of about .30 (r^2 < .09) is medium, r greater than .50 large (r^2 > .25) - r = .30 common in research areas such as personality and social psych - error variance - variance cannot be predicted from X - error - collective influence of several kinds of factors that include other predictor or causal variables not included in the study, randomness - r^2 obtain in study depends on many features unique to data set and on kinds of cases included in study - spurious - false, fake, not what it looks like or pretends to be - occur because of chance or coincidence - third variable (confounded variable) involved

ner Chapter 10 - Pearson r Scatterplot

- Mx - divides range of x values below and above mean on X scores - My line divides Y values into those below and above mean of Y - Regions II and II - concordant pairs - X scores and Y scores high, X scores and Y scores low - if most data falls in regions II and/or III - correlation tend to be large and positive - Regions I and IV discordant pairs - high X with low Y, low X with high Y - R tend to be negative if most data points in regions I and/or IV - If evenly distributed among four regions, pos and neg values equally common, cancel each other out, overall correlation close to zero - Pearson's r deflated if association between X and Y nonlinear or curvilinear - Pearson's r inflated or deflated bc of bivariate outliers - Pearson's r deflated if X and Y variables diff distribution shapes - Pearson's r deflated if scores on X and Y restricted ranges - Pearson's r overestimate if only groups of persons with extremely low and high scores on X and Y examined - Samples with members of diff groups can misleading info - If X and Y poor measurement reliability, correlations with other variables reduced - If duplicate questions in X and Y measures, highly correlated

Warner Chapter 10 - Research Situations Where Pearson's r is Used

- Pearson's r used when researchers want to know whether scores on two quant variables X and Y are linearly related - Pearson correlation obtained in sample = r - Correlation in population = p - Using r to estimate p - Reasons for selections of X and Y variables o Theory X might cause or influence Y o Past research suggests X predicts Y o X and Y may be diff ways to measure same thing - Sometimes X and Y diff measures of same thing - correlation tells us whether the two measures yield consistent reliable results

CBL Chapter 15 - Rating Scale - Thurstone

- Rating scales are formalized versions of questionnaires. The difference is that whereas a single item usually is used to represent each concept in a questionnaire, multiple items are used in creating scales. Because scales use multiple items to triangulate on, or to help define a concept, they are more appropriately used to measure attitudes, values, or personality dispositions, reflecting the view that people's attitudes or beliefs are not singularly defined. - In the typical Thurstone scale, respondents are asked to endorse the scale item or items with which they agree. Items are designed so that a single item, or a highly restricted range of items, should be endorsed by the respondent, and those that are more extreme or less extreme than the chosen alternative should be rejected. Items of this type have been termed nonmonotone (Coombs, 1950) or noncumulative (Stouffer, 1950), because it makes little sense to sum a respondent's scores over all items of the scale of this type. Agreement with one item does not imply an increased probability of agreement with any other item on the scale - Thurston's method of equal-appearing intervals is conducted in four phases. The first phrase in the scale construction process requires the researcher to generate many potential items, all of which appear at least initially to relate to the construct, object, or attribute of interest - In the second phase of the Thurstone scale construction process, a group of judges is recruited to develop the psychometric properties of the scale. Each judge independently estimates the degree of favorability or unfavourability expressed by each item toward the critical attitude object. - The third phase involves the selection of items for the final scale. Based on judges' ratings, the investigator determines the mean favorability rating for each item and its standard deviation - Administering the final set of items to participants is the fourth and final phase. We now move from the scale construction phases to the scale administration phase. In administering the scale, the researcher instructs participants to "indicate whether you agree or disagree with each item." - A major common indicator of scale quality—internal consistency (Chapter 3)—is not a meaning- ful concept in the context of the Thurstone scale, because such measures make use of participants' overall responses to all items of a scale. - With items of this type, the more favorable (or extreme) the respondent's attitude toward an object or issue, the higher (or more extreme) the individual's total attitude score. Results obtained from items from a monotone scale could be summed or averaged to derive a cumulative score of the construct or dimension being judged.

Warner Chapter 13 - reliability vs. validity

- Reliable - consistent results - Low reliability implies scores contain great deal of measurement error - Measure is valid if it measures what it purports to measure - Scores provide info about underlying construct or theoretical variables that it is intended to measure - Scores distinguish among people who have diff characteristics - Reduce measurement sensitivity - limited number of response alternatives and ceiling or floor effects - More alternatives more sensitive

Warner Chapter 1 - Uses for effect sizes

- Standardized effect sizes (such as Cohen's d or r) provide a basis for labeling strength of relationships between variables as weak, moderate, or strong. - Standardized effect sizes can be compared with those found in other studies and in past research. Additional information, such as raw-score regression slopes and group means in original units of measurement, can help readers understand the real-world or clinical implications of findings (at least if the original units of measurement were meaningful). - Effect size estimates from past research can be used to do statistical power analysis to make sample-size decisions for future research. - Finally, effect size information can be combined and evaluated across studies using meta-analysis to summarize existing information.

CBL Chapter 3 - Temporal Stability

- Temporal stability - Questions pertaining to this aspect of scale quality are concerned with the degree to which the observations obtained in a given test administration resemble those obtained in a second testing, which employs the same measure and the same respondent sample. - test-retest reliability, which is assessed by administering a scale to participants, and at a later time re-administering to the same participants, with the degree of relatedness calculated between two administrations. Respondents' scores on the first administration are compared to the scores they obtain on the second; a large positive correlation is taken as evidence of (temporal stability) reliability. - The major problem with the test-retest method is that the information it provides can prove ambiguous. - Chances are good that the correlation between participants' scores on the tests would be nearly perfect. This would not, however, necessarily indicate that the scale would be reliable (i.e., replicable) across a longer period of time. - Apparent temporal reliability will be enhanced artificially if participants can remember their previous responses and wish to appear consistent. Conversely, a very long delay between administrations can diminish temporal stability, because people do change over time. Thus, even a very good test can appear unreliable if the temporal separation between administrations is extreme.

Warner Chapter 13 - Cronbach's alpha

- The internal consistency reliability of a multiple-item scale tells us the degree to which the items on the scale measure the same thing. If the items on a test all measure the same underlying construct or variable, and if all items are scored in the same direction, then the correlations among all the items should be positive. - If perform FA on set of five items that are all supposed to measure the same latent construct, we would expect the solution to consist of one factor that has large correlations with all give items on the test - We can summarize information about positive intercorrelations between the items on a multiple-item test by calculating Cronbach's alpha reliability. Cronbach's alpha has become the most popular form of reliability assessment for multiple-item scales. - increases. In theory, as the number of items (p) included in a scale increases, assuming other characteristics of the data remain the same, the reliability of the measure (the size of the p × T component compared with the size of the Σe component) also increases. Cronbach's alpha provides a reliability coefficient that tells us, in theory, how reliable our estimate of the "stable" entity that we are trying to measure is, when we combine scores from p test items (or behaviors or ratings by judges). Cronbach's alpha (in effect) uses the mean of all the interitem correlations (for all pairs of items or measures) to assess the stability or consistency of measurement. - Cronbach's alpha can be understood as a generalization of the Spearman-Brown prophecy formula; we calculate the mean interitem correlation to assess the degree of agreement among individual test items, and then, we predict the reliability coefficient for a p-item test from the correlations among all these single-item measures. Another possible interpretation of Cronbach's alpha is that it is, essentially, the average of all possible split-half reliabilities. - It follows that we can increase the reliability of a scale by adding more items (but only if doing so does not decrease the mean interitem correlation) or by modifying items to increase (either by dropping items with low item-total correlations or by writing new items that correlate highly with existing items). - items). There is a trade-off: If the interitem correlation is high, we may be able to construct a reasonably reliable scale with few items, and of course, a brief scale is less costly to use and less cumbersome to administer than a long scale. - Note that when the items are all dichotomous (such as true/false), Cronbach's alpha may still be used to assess the homogeneity of response across items. In this situation, it is sometimes called a Kuder-Richardson 20 (KR-20) reliability coefficient. Cronbach's alpha is not appropriate for use with items that have categorical responses with more than two categories. - Other factors being equal, Cronbach's alpha reliability tends to increase as p, the number of items in the scale, increases.

CBL Chapter 3 - True Score - random and systematic error

- The true score is the replicable feature of the concept being measured. It is not "true" in the sense that it is a necessarily perfect or valid representation of the underlying construct. "True" in the present context signifies replicability, the component part of the observed score that would recur across different measurement occasions in the absence of error. - In this sense, the true score actually represents the reliable portion of the observed measurement across infinite potential measurements - the formula contains two unknown elements—the true score and error—making this equation unsolvable. To estimate the error component, we must have replication of measurement. - The greater the proportion of error, the less the observed score reflects the underlying true score, and the more unreliable the measure is said to be. - Random error is attributed to unexpected events that tend to artificially widen the variability or spread of observed scores in a nonsystematic way. - Examples include inadvertent mis recording - Chance events, by definition, are nondirectional on average, that is, across a large number of replications negative and positive errors would cancel each out. - According to classical test theory, if the group is large enough, random errors across participants will sum to zero Random error increases variability of scores, and increased variability reduces the power of tests of statistical significance. Obscured by random error, larger mean differences between groups are needed before the true differences between groups can be judged real or trustworthy - systematic error is attributed to unexpected events that tend to artificially inflate or deflate observed scores in a systematic way. - Systematic error should shift most, if not all, scores in the same direction. Because systematic error (bias) is not random, it does not cancel between groups; rather, it either exacerbates or mitigates differences that actually exist. - The situation that results depends on the direction of the bias - As we have illustrated above, systematic error (bias) can result in either Type I or Type II errors. Depending on the direction of the bias, it fosters conclusions of a stronger or weaker difference between groups. Random errors, on the other hand, affect the measure's reliability. Random error lessens the chances of finding a true difference between groups when, in fact, a true difference may exist. As such, random error fosters Type II errors.

CBL Chapter 3 - Reliability

- When measures are taken on a large group of individuals with a given instrument, the variability in obtained scores is due partly to differences among those individuals in their true scores on the measure, and partly to random and systematic fluctuations. Technically, the reliability of a measure is defined as the proportion of the total variance in observed scores that is due to true score variability. A perfectly reliable instrument would be one in which this proportion was equal to 1.00, or in which true score equaled observed score - A perfectly unreliable score, on the other hand, would be one in which the observed score equaled the sum of the error components, and true score contributed nothing to the observed score - Reliability has referred to the degree to which participants' scores on a given administration of a measure resembled their scores on the same instrument administered at some later point in time—or the extent to which two judges, observing the same behavior, produced the same ratings of the behavior. If the test-retest scores tended to be very similar (i.e., highly interrelated), the measure (or the judges) was said to be reliable. Or, if parallel forms of a test—two forms of the test that are thought to measure the same construct—were highly correlated, the test was said to be reliable - However, reliability also has come to indicate the degree to which the set of items or questions within a particular multiple-item scale are interrelated.

CBL Chapter 4 - Validity

- Whereas reliability has to do with the internal qualities of measurement, the validation of operations relative to the hypothetical concepts under investigation is crucial from the standpoint of theory development. - It is easily conceivable that the procedures usually followed to generate a reliable scale of individual differences could lead to an internally consistent, temporally stable instrument that had no relationship whatever to the theoretical attributes that motivated the research in the first place. - Basically, the validity of a scale refers to the extent of correspondence between variations in the scores on the instrument and variation among respondents (or other objects of measurement) on the underlying construct being studied. - In classical test theory, the true score represents a measure of the replicable "shared variation," or the "common factor" that underlies participants' responses to all items. Whether this response factor adequately reflects the particular conceptualization that the investigator wants to measure, however, is still a matter for investigation. - Validation always requires empirical research beyond that used in the scale construction (reliability) phase of instrument development and refinement. This validation process will invariably focus on the relationship of the scale with some other indica- tors of the construct under investigation. - definition of validity as a "judgment of the degree to which evidence and theoretical rationales support the adequacy and appropriateness" of a construct. This definition suggests that validity is not a thing or a feature of a measure, but rather an aspect of the interpretation of a measure. As such, validity is always open to question, review, and revision—it is never a closed issue, but rather a continuous process - The fact that previous research demonstrated the validity of a scale does not necessarily imply that it will be valid in another setting, with different respondents, at different times, and so on - Because validity changes from time to time and from sample to sample, it should be reevaluated periodically to ensure that what once was a valid indicator of some theoretical construct

Chi-square Test SPSS

- do males and females significantly differ in marijuana use - percentage of IV in DV categories (e.g., 60% of female participants use marijuana) - IV and DV both categorical - df calculated using cross tab tables of (#columns - 1)(# rows - 1) - 2 x 2 will always have df of 1 - if a cross tab cell has a n < 5 interpret the Fisher's Exact Test "A 2(male vs. female) x 2(marijuana user vs. non user) chi-square test was performed. The percent of males who used marijuana (70%) was significantly higher than the percent of females who used marijuana (40%), X^2 (1) = 5.01, p < .05." "A 2(male vs. female) x 2(marijuana user vs. non user) chi-square test was performed.. The proportion of males who used marijauana (.75) was significantly higher than the proportion of females who used marijuana (.40), X^2 = 5.01, p < .05."

CBL Chapter 3 - Generalizability theory

- generalizability theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972). This approach recognizes that irrelevancies (error) can be introduced into a test by many different fac- tors, or facets, to use their terminology. These irrelevancies can reside in observers, items, contexts, occasions for measurements, respondents, etc. The more generalizable the instrument—that is, the more consistent it is across occasions, respondents, contexts, etc., the better or more trustworthy the instrument is.

Video - twins, environment vs. genetics

- how closely correlated for certain traits (e.g., personality, intelligence) - compare twins raised apart to this raise together - raised part reflect genes, raised together - genes and environment - diff in size of correlation shows influence of family environment - pos correlation personality raised apart - comparing fraternal and identical twins - level of activity in variety of situations - impact of genes - identical - when activity measure at home, correlations approx the same - common environment - when measures in lab correlation higher for identical twins

Why use syntax

- personal preference for typing syntax commands (instead of point and click) - probably better use of time to memorize commands for programs without a point-and-click interface - maintain history/record of analyses conducted previously - wise to save syntax for analyses that are highly sensitive to typing errors (e.g., reverse scoring, computing mean composite) - save time in performing many analyses - to batch perform the same statistical analysis many times (on different variables or datasets)

Cohen Article

- power = probability investigation would lead to stat sig results - mean power to detect medium effect sizes .48 - statistical power analysis exploits relationships among the four variables involved in statistical inference: sample size (N), significance criterion, population effect size (ES), and statistical power Significance criterion - risk of mistakenly reacting null hypothesis and thus of committing a Type I error - maximum risk attending such a rejection Power - probability given the population ES, a, and N of rejecting the null hypothesis - a materially smaller value than .80 would incur too great a risk of Type II error - a materially larger value would result in a demand for N that is likely to exceed the investigator's resources Sample size - need to know N necessary to attain desired power for specific sig criterion and hypothesized ES - N increases with an increase in the power desired, a decrease in the ES, and decrease in criterion sig Effect size - having some idea about the degree to which the null hypothesis is believed to be false - degree to which null hypothesis is false indexed by discrepancy between H0 and H1 = ES - for the H0, ES = 0 - intent medium ES represent an effect likely to be visible to the naked eye of a careful observer - set large ES to be same distance above medium as small was below it - H0 d=0, small, medium and large ESs (or H1s) are d = .20, .50, and .80 - operationally defined medium difference between means is half a standard deviation - the ES posited but the investigator is what he or she believes holds for the population and that the sample size that is found is conditional on the ES - either r is smaller then .30 or the investigator has been the victim of the .20 beta risk of making a Type II error

Warner Chapter 1 - Practical and Clinical Significance

- practical significance - diff btw group means large enough to be valued - in regression study - prac sig = large and valuable increases in outcome variable as scores on IV increase - Cohen's d sometimes interpreted in terms of clinical sig - Examining diff btw group means in original units can be more useful to evaluate clinical or practical importance - Examples of criteria could be used judge whether results clinically or practically sig o Age group means so far apart one mean is above and other mean is below some diagnostic cutoff value o How large does a difference have to be for most people to even notice or detect it - should be noticeable in everyday life

5 Types of Measurement Validity: 1

1) Face validity - superficial impressions regarding extent that measure appears to capture a construct - aim: created items in measure should "appear" to assess construct Criticisms - aim is now always for high fave validity (some measures aim for low face validity to minimize malingering) - e.g., MMPI designed to have low face validity so don't know purpose - not based on research or data (unlike other types of measurement validities) - thus, some methodologists (including Cambell) do not consider it a type of measurement validity - design items that don't appear to measure construct so don't figure out true purpose of study - does content of items appear to be valid assessment of construct

3 Types of reliability: 1

1) Internal consistency: degree that items/observations are consistent (hand together) in assessing the measure Split-half reliability (not used anymore) 1) Devise one "whole" test/measure and administer to participants 2) Randomly divide items into two sets (each containing approximately equal # of items) 3) For each of the 2 sets, compute mean composites - correlate mean composites - evaluate extent highly correlated or hanging together 4) Compute correlation. Aim: high r Cronbach's alpha (advanced version of split-half reliability) - every combo of split-half reliability correlation - performs every combo used to compute cronbach's alpha 1) calculate very possible combination of "split-half" reliability correlation 2) compute average of all these split-half reliability correlations to yield Cronbach's alpha. Aim: .75 or higher (.60 minimum) - suggest items hanging together - range 0 (poor) to 1.00 (excellent) - higher value due to both: a) higher inter-item correlation magnitudes and b) greater # of items - stronger correlation of items produce higher cronbach's alpha - greater # of items more likely yield higher cronbach's alpha - cronbach's alpha then can compute composite score - use reversed scored items - if items don't hang together, cannot compute mean composite

Statistical assumptions Pearson r

1) Normal distribution for each X and Y variable - skewness index indicates approx. normal distribution - no extreme outliers 2) Homoscedasticity - for each value of X, variances for Y values approx. the same (scatterplot) - approx same variance at each x axis - Variant of "homogeneity of variances assumption" - if violated, referred to as "heteroscedasticity" 3) X variable linearly related to Y variable - straight line roughly represents the relation (does not detect nonlinear relations) 4) Independence of participants - each participant provides only one pair of X and Y scores - rare exception: dyadic designs (e.g., in a twin pair, member A provides the X score and member B provides the Y score)

Semantic Differential (Osgood, 1957)

1) research gerjneates many sets of response options to tap construct (no judges involved) - single concept is presented directly, but represented using different response options 2) participants choose from a "multiple choice" response (or place a tick mark) for each item - response option anchors: bipolar adjectives (antonyms) - examples: 1 (dirty) to 5 (clean), 1 (expensive) to 5 (affordable), 1 (bad) to 5 (good) - participant's score: compute mean (or total) across items (composite score) - diff response options - not diff item statements like Likert Factor analysis of semantic differential items across various research topics indicate 3 dimensions of response anchors - evaluation (positivity): 1 (bad) to 5 (good), 1 (negative) to 5 (positive), 1 (unfair) to 5 (fair), 1 (sad) to 5 (happy) - potency (power): 1 (weak) to 5 (strong), 1 (soft) to 5 (hard) - activity (movement): 1 (slow) to 5 (fast), 1 (passive) to 5 (active), 1 (indecisive) to 5 (decisive), 1 (lazy) to 5 (industrious) - thus, consider these 3 types of response anchors when creating a semantic differential scale

Method of Summated Ratings (Likert, 1932)

1) researcher generates many item statements to tap construct (no judges involved) - concept is represented using different statements, but using same response options 2) participants choose from "multiple choice" response for each item - examples: 1 (strongly disagree) to 5 (strongly agree), 1 (mostly false) to 5 (mostly true), 1 (never) to 5 (always) - participant's score: compute mean (or total) across items - compute score for each item, sum them up, total score - represents composite which is used in analysis - most commonly used - diff item statements - hope tapping entire range of construct trying to measure - higher values should rep more of construct - advantage in terms of time savings

Scalogram (Guttman, 1944)

1) researcher generates many items of different attitude extremities to tap construct 2) next, arrange each item from low to high attitudinal extremity (using judges) 3) administer to participants. participants indicate agreement or endorsement ("yes") to each item - participant's score: # of items endorses - Aim: high "coefficient of reproducibility" - total # of times endorsed by a participant should be sufficient for researcher to infer response pattern - e.g., if participant endorsed 3 items likelihood is "yes" to items 1, 2, 3 (because top three least extreme and bottom two most extreme) - if endorse more extreme items likely to have endorsed less extreme items

5 Types of Measurement Validity: 2

2) Content validity - extent measure adequately represent (or samples) the complete breadth of the construct - easier to achieve with factual (e.g., SAT. math) than psychological measures - psych measures more disagreement of what items to include - don't wants items narrowly focused on one aspect of construct - want entire range Methods - literature review - to find conceptual definitions to help generate appropriate items for measure - e.g., how do other people define hat construct - experts and focus groups - to help generate appropriate items for measure - e.g., how do be affected by construct define it - factor analysis (post hoc) - to determine if times and factors were adequately represented in measure Aim - generate diverse items to adequately cover entire definitional range of construct

3 Types of reliability: 2

2) Temporal stability: degree that items/observations in measure are consistent over time Test-retest reliability - same scale measures across two time points - if consistent across time - test-retest reliability 1) Administer measure to participants (T1) 2) Re-administer same measure to same participants at later time point (T2) 3) For each time point, compute mean (or total) composite 4) compute correlation. Aim: high r (although attenuated by time) Parallel forms reliability - e.g., two version of self-esteem questionnaire - one version developed at one time and other at diff time - intended to measure same construct 1) devise 2 separate parallel (equivalent/alternative) versions intended to assess same construct (versions created at different time points) E.g., Version A and B 2) administer both versions to same participants at same time point 3) for each version, compute mean (or total) score 4) compute correlation. Aim: high r

5 Types of Measurement Validity: 3

3) Criterion validity - extent measure is statistically related to (or explains) a relevant outcome or behavior (criterion) concurrent validity - measure is related to a criterion (cross-sectionally) - statistically associated with outcome of interest - your new measure (T1) is correlated with outcome (T1) - e.g., sensations seeking scale with alcohol or marijuana use (all assessed at same time point) - aim - moderate to high correlation predictive validity - measure is related to a criterion (longitudinally) - your new measure (T1) is correlated with outcome (T2) - both assessed at two diff time points - e.g., sensation seeking scale with one month later measures of alcohol and marijuana use - aim: moderate to high correlation (although attenuated by time)

5 Types of Measurement Validity: 5

5) Discriminant validity - extent measures of different constructs that are theoretically unrelated are statistically unrelated - your new measure is not correlated with established measure - e.g., sensation seeking with measure of self-esteem - aim: zero to low correlation - greater number of the types of measurement validities satisfied, stronger evidence for measurement (construct) validity of new measure

Chi-square Test

AKA - Pearson chi-square - Chi-square test of association - Two-way (between-subjects) chi-square - Test of difference between 2 proportions - not comparing mean scores, comparing frequencies/proportions - comparing 2 proportions to determine if sig diff from one another - like t test but both IV and DV categorical - IV/predictor: categorical variable (2 or more levels) - DV/outcome: categorical variable (2 or more levels) - most common: 2 (IV) x 2 (DV) chi-square Conceptual/inferential - H0: IV not sig. related to DV (participants in the IV categories are not sig. different on the DV categories - due to chance) - H1: IV sig. related to DV (participants in the IV categories are sig. different on the DV categories)

Pearson r (correlation)

AKA - Pearson correlation - product-moment correlation coefficient - correlation (generic term tat also refers to various correlation types: Pearson, r, Point biserial r, and phi coefficient) - Predictor (X): Quantitative - Outcome (Y): Quantitative - participants must provide scores for both variables Conceptual - H0: the correlation is not sig. diff. from a zero correlation - H1: the correlation is sig diff from a zero correlation Inferential (generalize from sample to pop.) - H0: pxy = 0 - H1: pxy =/ 0 - why important: correlation is the foundation for understanding multiple regression Two characteristics of a correlation 1) Direction: positive, zero, or negative 2) Strength (effect size): |.00| to |1.00|

Confidence Intervals (CI) Application 1

Application 1. Determine confidence of your sample in estimating the pop. mean - index of "confidence" (certainty or precision) that the sampling procedures to obtain your sample (assuming participants were drawn randomly from pop. of interest) captures the pop. mean - Basically, recognizes that a single sample mean (based on data from your study) is imprecise (by providing an interval of certainty-uncertainty due to sampling error) in capturing pop mean - as sample size increase --> sampling error becomes smaller (therefore confidence intervals become narrower: closer to sample mean) --> greater confidence/certainty/precision in estimating the pop. mean - an extremely large sample containing many participants from the pop. will have a narrow CI - 3 components of a confidence interval M = 98 [95% CI: 94 to 102] - M = point estimate (e.g., mean or mean % of sample) - 94 = lower limit - 102 = upper limit Interpreting the 95% CI - correct: based on the sample, 95% likelihood (confidence or certainty) that this computed interval (range) will contain the true pop. mean - incorrect: this interval contains 95% of the values from the pop (instead the confidence interval is calculated based on the sample to estimate the pop mean) Example 1: your study's sample is IQ M = 98 [95% CI: 94 to 102] - "95% certainty that the interval of 94 to 102 as calculated based on the sample will capture the true pop. mean" - 5% certainty that this interval will not capture the true pop. mean Example 2: a sample of 1067 likely voters indicates that 52% will be voting for Biden, with a margin of error of +/- 3% - translation: 52% [95% CI: 49% to 55%] - based on the sample, 95% certainty that the interval of 49% to 55% will capture the true pop. mean of the percent of likely voters who will be voting for Biden - in election language, margin of error is the upper and lower bound for a 95% confidence interval - Confidence interval is an interval estimate for some unknown population characteristic or parameter based on information from a sample - Interval estimate with lower and upper boundaries - Steps to set up CI o Decide on C (level of confidence) - usually 95% o Assuming sample statistic has normally shaped sampling distribution, use critical values from a z or standard normal distrib that correspond to middle 95% of values - e.g., for standard normal distribution middle 95% corresponds to the interval between zlower = -1.96 and zupper = +1.96. (Rounding these z values to -2 and +2 is reasonable when thinking about estimates.) o Find standard error o Compute CI

Correlation family (3 types)

Pearson correlation - association between a quantitative variable and a quantitative variables Point biserial correlation (rpb) - association between binary variable and quantitative variable Phi correlation: association between 2 binary variables Two characteristics for each correlation type 1) direction: positive, zero, or negative 2) strength (effect size): |.00| to |1.00| - Phi correlation same p-value as 2 x 2 chi-square test - effect size for 2 x 2 chi-square is phi correlation - e.g., Pearson correlation = .86, p < .05.

Research Stats 2: Exam 1

Set pelajaran terkait

Chapter 22 PT 2 MC 631-637

Psychology Chapter 15

MICRO ECONOMICS FINAL

Chapter 5

Learning and Conditioning Ch.6

Econ ch 3

Chapter 5 - Medical Expense Insurance

DE Unit 21

BCIS 3680 Exam 2 Study Guide

BUSI 4350- Chapter 43

A Comparison of Fiscal and Monetary Policy- Review

Ch. 11 - Preparing to Take a Listing

PL-RI-2 Configuración de Dispositivos, Protocolos y Comunicaciones de Red

HINF 410 - Chapter 12

Series 7 Chapter 12

Week 3 Draw It Quiz Question 1

Pharmacology EAQs

N436 Pre, Intra and Post Op Q's

Indiana Insurance exam practice

chapter 8 CB