Measurement

¡Supera tus tareas y exámenes ahora con Quizwiz!

Ambiguity

Acquiescence and social desirability to affective items

Attenuation

validity coefficient divided by the square root of the reliability of x multiplied by the reliability of y -The predicted validity if two measures were perfectly reliable. Correlations may be attenuated due to unreliability -A correlation coefficient is attenuated, or reduced in strength, by any unreliability present in the two measures being correlated. So, if your test and the criterion test are unreliable, a low validity coefficient (the correlation between the two tests) may not necessarily represent a lack of relationship between the two tests.

Actual Zone of Development

"What I can do". Where the learner is right now. Current level of cognitive development.

Judgment vs Sentiment

(cognitive, objective) have correct answers (in what year was abortion legalized); achievement and ability tests should almost always use judgment items Sentiment (affective, subjective) involve preferences, attitudes, etc. (should abortion remain legal in the US) Attitudinal or personality measures can use either judgment or sentiment items

Interval

(order with intervals) With an interval scale, the first three conditions for a ratio scale are still met: all values on the scale are unique, they can be ordered from low to high, and an interval at any point on the scale means the same thing at other points on the scale. (e.g. Test scores because we use a scale of measurement, temperature in degrees) oTo convert to ratio scale interval scale,we simply have to rescale it to no longer have a meaningful zero point.

Ordinal

(rank order) Tells us the order, but not how much more of a certain trait is present (i.e., the values must be able to be ranked) (e.g. likert scales, letter grades on exams)

Bloom's taxonomy

-A set of three hierarchical models used to classify educational learning objectives into levels of complexity and specificity. Educators have typically used Bloom's taxonomy to inform or guide the development of assessments (tests and other evaluations of student learning), curriculum (units, lessons, projects, and other learning activities), and instructional methods such as questioning strategies. -The goal of an educator's using Bloom's taxonomy is to encourage higher-order thought in their students by building up from lower-level cognitive skills.

Test Use

-As a final general consideration, we need to examine the recommended uses for the test we are evaluating. Here are the questions we should ask. -What are the recommended uses or test score inferences or interpretations? Do these uses match the test purpose? Does the test development process support the intended use? And are there appropriate cautions against unsupported test uses?

The Basic Principles for the MTMM are

-Coefficients in the reliability diagonal should consistently be the highest in the matrix. That is, a trait should be more highly correlated with itself than with anything else! This is uniformly true in our example. -Coefficients in the validity diagonals should be significantly different from zero and high enough to warrant further investigation. This is essentially evidence of convergent validity. All of the correlations in our example meet this criterion. -A validity coefficient should be higher than values lying in its column and row in the same heteromethod block. -A validity coefficient should be higher than all coefficients in the heterotrait-monomethod triangles. This essentially emphasizes that trait factors should be stronger than methods factors. -The same pattern of trait interrelationship should be seen in all triangles.

Convergent validity

-Convergent validity takes two measures that are supposed to be measuring the same construct and shows that they are related. -It is a subtype of construct validity -Convergent validity is usually accomplished by demonstrating a correlation between the two measures, although it's rare that any two measures will be perfectly convergent. -Convergent validity is sometimes claimed if the correlation coefficient is above .50, although it's usually recommended at above .70.

Interrater Reliability Correlation

-Correlation (r) can be used with interval scale ratings to estimate interrater reliability. r defines consistency as the extent to which the rank ordering of individuals is consistent across raters. r can be high, despite systematic differences in ratings. The correlation coefficient is often used when ratings are given on an ordinal scale. This may not be entirely appropriate, since the correlation involves calculations of means and standard deviations which should only be used with interval scales. However, it can still be useful for describing interrater consistency. -Correlation coefficients ignore systematic differences b/c it is interested in the ORDER -Correlation can be used with continuous ratings to estimate interrater reliability -If one rater consistently scores an item higher than another; correlation ignores that (means can be way different)

Criterion-Referenced Tests

-Criterion-referenced tests and assessments are designed to measure student performance against a fixed set of predetermined criteria or learning standards—i.e., concise, written descriptions of what students are expected to know and be able to do at a specific stage of their education -If students perform at or above the established expectations—for example, by answering a certain percentage of questions correctly—they will pass the test, meet the expected standards, or be deemed "proficient." On a criterion-referenced test, every student taking the exam could theoretically fail if they don't meet the expected standard; alternatively, every student could earn the highest possible score. Unlike norm-referenced tests, criterion-referenced tests measure performance against a fixed set of criteria

Validation Process

-Define the purpose --For Example: Early Literacy: To measure early literacy and identify students in need of additional reading support -Define the domain --In education, for example, end-of-year testing, it's typically based on a curriculum. In psychology, it's typically based on research and practice. How would you confirm that this content domain is adequate, or representative of the construct? Items/tasks/behaviors/facets/dimensions that represent the construct Everything that represents the construct -Write items that fit/sample from the domain Content area, response type, cognitive type, difficulty level, etc. -Organize items into an outline/blue print

Validation process

-Develop a theory that defines what the construct is and isn't -To what extent do scores mean what they're supposed to mean? -To what extent to scores represents levels on the construct? -All types of validity fall under construct validity

Factor Analysis

-Factor analysis is a multivariate statistical method used to evaluate relationships among a set of observed variables. -The purpose of factor analysis is to reduce many individual items into a fewer number of dimensions. Factor analysis can be used to simplify data, such as reducing the number of variables in regression models.

Reliability

-Here are some of the questions that you need to ask when evaluating the relevance of reliability evidence for your test purpose. -What types of reliability are estimated? Are these types appropriate given the test purpose? Are the study designs appropriate given the chosen types? Are the resulting reliability estimates strong or supportive?

Validity

-Here are some questions that you need to ask when evaluating the relevance of validity evidence for your test purpose. These are essentially the same as the questions for reliability. -What types of validity evidence are examined? Are these types appropriate given the test purpose? Are the study designs appropriate given the chosen types of validity evidence? Does the resulting evidence support the intended uses and inferences of the test?

Bloom's taxonomy components

-Knowledge: indicates recall of previously learned material. Key verbs include: define terms, identify facts, matching, recognizing, and selecting. When you see verbs like these, you might be looking at a knowledge task. -Understanding: or comprehension indicates understanding of the meaning of material. Key verbs include distinguish between, explain, describe, or summarize, provide examples of. -Application: indicates application of learned material to new situations. Keywords include solve, demonstrate, modify, change, convert, predict. -Analysis: indicates breaking down of material into component parts and understand organizational structure. Key words include: identify parts, analyze relationships, compare and contrast, differentiate, discriminate, distinguish. -Synthesis: indicates ability to combine components or pieces to create new material. Key phrases include: write a story or summary, propose an experiment, compile, compose, rearrange, reconstruct, reorganize. -Evaluation: indicates ability to judge the value of material for a given purpose based on a set of criteria. Examples include: evaluate the quality of an argument, describe the extent to which conclusions are supported by data, appraise, critique, interpret.

Aptitude Tests

-Measure a person's present performance on selected tasks to provide information that can be used to estimate how the person will perform at some time in the future or in a somewhat different situation -Referred to as intelligence testing -"What a person can learn to do"

Achievement Tests

-Measure a person's present performance on tasks to provide information that can be used to identify current level of performance -Used to describe learning or development based on past instruction -"What a person has learned to do"

Interrater Agreement Percentage of Agreement

-Percentage of agreement quantifies the number of times two raters (or one rater on two occasions) agree. -A major flaw with this type of inter-rater reliability is that it doesn't take chance agreement into account and overestimate the level of agreement. This is the main reason why percent agreement shouldn't be used for academic work -Major benefit: it can be used with any type of measurement scale

Nomologic network

-The entire set of relationships between our construct and other available constructs is sometimes referred to as a nomological network. This network outlines what the construct is, based on what it relates positively with, and what it is not, based on what it relates negatively with. -Validity evidence based on nomological validity is a general form of construct validity. It is the degree to which a construct behaves as it should within a system of related constructs (the nomological network) -Example: As age increases, memory loss increases (constructs are age and memory loss; both can be measured empirically)

Midgley recommends the following steps when reviewing tests:

-Theory, purpose, and construct -Administration, standardization, norm group -Internal consistency reliability -Stability -Convergent validity -Discriminant validity -Construct validity

Issues with Criterion validity

-There are two other challenges associated with criterion validity. First, finding a suitable criterion can be difficult, especially if your test measures a new or not well defined construct. Second, a correlation coefficient is attenuated, or reduced in strength, by any unreliability present in the two measures being correlated. -Criterion validity is limited because it does not actually require that our test be a reasonable measure of the construct, only that it relate strongly with another measure of the construct. The take-home message is that you should never use a criterion relationship as your sole source of validity evidence. -Assumes that the target construct can be represented in another measure, variable, outcome, or process (the criterion)

Study Design

-To interpret and evaluate reliability and validity, we should first consider the strength of the reliability or validity study designs. -Here are some basic questions to ask when evaluating reliability and validity study designs for a test. --Is the study sample representative? --Is the sample randomly or intentionally selected? --Are appropriate age/gender/ethnic other groups included? Regardless of strength or magnitude, reliability and validity coefficients may be irrelevant if they are based on a weak (e.g., non-random or biased) study design or the wrong population

Divergent validity

-discriminant validity shows that two measures that are not supposed to be related are in fact, unrelated. Both convergent and divergent validity are a requirement for excellent construct validity. -To show discriminant validity, you could show that there is no correlation at all.

Scoring

-there are a few key questions to ask when evaluating the scoring that is implemented with a test. -What types of scores are produced? That is, what type of measurement scale is used? What is the score scale range? How is meaning given to scores? And what type of score referencing is used, and does this seem reasonable? Finally, what kinds of score reporting guidelines are provided, and do they seem appropriate

Mastery Test

A criterion-referenced test designed to indicate the extent to which the test taker has mastered a given unit of instruction or a single domain of knowledge or skill. Mastery is considered exemplified by those students attaining a score above a particular cut score (i.e., passing score).

Distribution Assumptions

A distribution is called a normal distribution when certain amounts of it fall within a central midpoint. For example, in a normal distribution, roughly 68% of scores should be found within 1 standard deviation of the mean, and frequencies should decrease and taper off as they get further away from the center. A distribution that tapers off to the left but not the right is described as negatively skewed, whereas tapering to the right but not left is positive skew. Finally, a distribution that is more peaked than normal is called leptokurtic, with high kurtosis, and a distribution that is less peaked than normal is platykurtic, with low kurtosis.

Double Barreled Questions

A double-barreled question is a question composed of more than two separate issues or topics, but which can only have one answer. A double-barreled question is also known as a compound question or double-direct question. In research, they are often used by accident. Surveyors often want to explain or clarify certain aspects of their question by adding synonyms or additional information. Although this is often done with good intentions, this tends to make your question confusing and, of course, double-barreled. There's no way of discovering the true intentions of the respondent from the data afterward, which basically renders it useless for analysis. Example: Is this tool interesting and useful?

Speeded Test

A test in which performance is primarily measured by the time to perform a specified task or by the number of tasks performed in an allotted amount of time. A speed test also refers to a test scored for accuracy, while the test taker works under time pressure. Typing tests and tests of reading speed (e.g., number of words per minute) are two examples of speed tests. In an educational testing context, the item difficulties of a speed test are generally such that given no specified time limit, all test takers should be able to complete all test items correctly. Contrast to Power Test.

True/False

A true or false question consists of a statement that requires a true or false response. Effective true or false questions are factual based and are designed to quickly and efficiently test learner knowledge about a particular idea or concept. ▪ Is considered a selected-response. ▪ Best suited for categorical knowledge o Advantages ▪ Quick and easy to score ▪Objectivity in scoring ▪ Ability for more items oIssues/Disadvantages ▪ Considered to be "one of the most unreliable forms of assessment". Not sound because it's hard to write them well ▪ Often written so that most of the statement is true save one small, often trivial bit of information that then makes the whole statement untrue ▪ Encourage guessing, and reward for correct guesses

Normal curve equivalent

A way of standardizing scores received on a test into a 0-100 scale similar to percentile rank, but preserving valuable equal interval properties of a z-score NCE scores have a major advantage over percentile rank scores in that they can be averaged. That is an important characteristic when studying overall school performance, and in particular, in measuring school‐wide gains and losses in student achievement

Acquiescence

Acquiescence refers to a tendency for examinees to agree with items regardless of their content. The pattern may result from an underlying examinee disinterest and lack of involvement, or from a desire simply to respond in the affirmative. Whatever the cause, the result is consistent endorsement of items. One way to identify and potentially reduce acquiescence is to use both positively and negatively worded items. Examinees who acquiesce may notice the shift in emphasis from positive to negative and respond more accurately. Examinees who endorse both positively and negatively worded items will have inconsistent scores that can be used to identify them as invalid.

Advantages and Disadvantages of MTMM

Advantages The MTMM idea provided an operational methodology for assessing construct validity. In the one matrix it was possible to examine both convergent and discriminant validity simultaneously. By its inclusion of methods on an equal footing with traits, Campbell and Fiske stressed the importance of looking for the effects of how we measure in addition to what we measure. And, MTMM provided a rigorous framework for assessing construct validity. Disadvantages -MTMM requires that you have a fully-crossed measurement design - each of several traits is measured by each of several methods. -Second, the judgmental nature of the MTMM may have worked against its wider adoption (although it should actually be perceived as a strength) -Finally, the judgmental nature of MTMM meant that different researchers could legitimately arrive at different conclusions.

Internal Consistency

After conducting the scoring procedures and examining item difficulty and item discrimination, we then consider the internal consistency reliability for our test with coefficient alpha, which is a measure of internal consistency. A high coefficient alpha tells us that people tend to respond in similar ways from one item to the next. If coefficient alpha were perfectly 1, we would know that each person responded in exactly the same way across all items. We use alpha in an item analysis to identify items that contribute to the internal consistency of the item set. Items that detract from the internal consistency should be removed.

Item Bank

An item bank is a term for a repository of test items that belong to a testing program, as well as all information pertaining to those items. Item banking refers to the process of storing items for use in future potentially undeveloped forms of a test.

Central Tendency

Central tendency provides statistics that describe the middle, most common, or most normal value in a distribution. The mean, which is technically only appropriate for interval or ratio scaled variables, is the score that is closest to all other scores. The mean also represents the balancing point in a distribution, so that the further a score is from the center, the more pull it will have on the mean in a given direction. It DOES NOT FULLY DESCRIBE A DISTRIBUTION OF SCORES (2 sets of scores could both have a mean of 100 but the variability could be completely different)

Absolute vs Comparative Responses

Comparative response relates 2 or more stimuli (which of the following best describes...social interactions, advance career opportunities) whereas Absolute responses - concerns only a single stimuli; absolute items ask respondents to rate some attribute using response options such as Likert scales.

Synthesis

Compile information together in a different way by combining elements in a new pattern or proposing alternative solutions. "Combine..."

Confirmatory Factor Analysis

Confirmatory factor analysis or CFA is used to confirm an appropriate factor structure for an instrument. Whereas EFA provides tools for exploring factor structure, it does not allow us to modify specific features of our model beyond the number of factors. Furthermore, EFA does not formally support the testing of model fit or statistical significance in our results. CFA extends EFA by providing a framework for proposing a specific measurement model, fitting the model, and then testing statistically for the appropriateness or accuracy of the model given our instrument and data.

Correlation vs covariation

Covariability, similar to variability, describes how much scores are spread out or differ from on another, but it takes into account how similar these changes are for each person from one variable to the other. As changes are more consistent across people from one variable to the other, covariability increases. Covariability is most often estimated using the covariance and correlation. Covariability is calculated using two score distributions, which are refereed to as a bivariate score distribution. The covariance then is the bivariate equivalent of the variance for a univariate distribution, and it is calculated in much the same way Correlations are commonly used to index the strength and direction of the linear relationship between two variables. They both describe the degree of similarity between two variables or sets of variables Correlation is dimensionless; it is the expected value between two variables Covariance is units obtained by multiplying the units of the two variables; the covariance of a variable with itself is called the variance (X=Y); covariance is the expected value variations of two random variables from their expected values

Criterion-Related Validity

Criterion validity (or criterion-related validity) measures how well one measure predicts an outcome for another measure. A test has this type of validity if it is useful for predicting performance or behavior in another situation (past, present, or future). There are three types pf criterion validity Predictive Concurrent Postdictive

Cronbach's Alpha

Cronbach's alpha is a measure of internal consistency, that is, how closely related a set of items are as a group. It is considered to be a measure of scale reliability. -It is used under the assumption that you have multiple items measuring the same underlying construct -Cronbach's alpha does come with some limitations: scores that have a low number of items associated with them tend to have lower reliability, and sample size can also influence your results for better or worse. -Coefficient alpha is popular because it can be calculated directly for a single test administration and it does not require that a test be split into two equal halves. Splitting a test equally can be difficult, because the split-half reliability will be impacted by how similar the two chosen half-tests are. -Coefficient alpha avoids this problem by estimating reliability using the item responses themselves as many miniature versions of the total test. -Alpha assumes that the items are unidimensional, but it is not an index of one-dimensionality.

Interaction of Testing and Treatment

Does testing or measurement itself make the groups more sensitive or receptive to the treatment? If it does, then the testing is in effect a part of the treatment, it's inseparable from the effect of the treatment. This is a labeling issue (and, hence, a concern of construct validity) because you want to use the label "program" to refer to the program alone, but in fact it includes the testing.

Trait-method unit

Each task or test used in measuring a construct is considered a trait-method unit; in that the variance contained in the measure is part trait, and part method. Generally, researchers desire low method specific variance and high trait variance.

Systemic Error

Error that is consistent across trials—error occurs consistently in both testing. A systematic error is one that influences a person's score in the same way at every repeated administration of a test. -Variation from trial to trial in the individual's response to the task at a particular moment -Variation in the individual from one time to another -Not "true error" -Systematic errors are consistently in the same direction -It's difficult to detect — and therefore prevent — systematic error. In order to avoid these types of error, know the limitations of your equipment and understand how the experiment works. This can help you identify areas that may be prone to systematic errors.

Random Error

Error that is not consistent across trials. A random error is one that could be positive or negative for a person, one that changes randomly by administration. -You can't predict random error and these errors are usually unavoidable. -Random errors produce different values in random directions. -Random error can be reduced by: --Using an average measurement from a set of measurements, or --Increasing sample size.

Six major considerations when examining a construct's validity

Evaluation of convergent validity Evaluation of discriminant (divergent) validity Trait-method unit Multitrait-multimethod Truly different methodology Trait characteristics

Exploratory Factor Analysis

Exploratory factor analysis or EFA is used to explore the factor structure of a test or instrument. We may not know or understand well the number and types of factors that explain correlations among our items and how these factors relate to one another. a statistical technique that is used to reduce data to a smaller set of summary variables and to explore the underlying theoretical structure of the phenomena. It is used to identify the structure of the relationship between the variable and the respondent.

Item Discrimination

How well an item distinguishes between people low versus high on the construct or criterion -item discrimination tells us how item difficulty changes for individuals of different abilities. Discrimination extends item difficulty by describing mean item performance in relation to individuals' levels of the construct. -Highly discriminating cognitive items are easier for high ability students, but more difficult for low ability students. -Determined by slope

Intelligence quotients

IQ scores, 100 is average, SD of 15

IRT Overview

IRT addresses the limitations of CTT, the limitations of sample and test dependence and a single constant SEM. As in CTT, IRT also provides a model of test performance. However, the model is defined at the item level, meaning there is, essentially, a separate model equation for each item in the test. So, IRT involves multiple item score models, as opposed to a single total score model. When the assumptions of the model are met, IRT parameters are, in theory, sample and item independent. This means that a person should have the same ability estimate no matter which set of items she or he takes, assuming the items pertain to the same test. And in IRT, a given item should have the same difficulty and discrimination no matter who is taking the test. IRT also takes into account the difficulty of the items that a person responds to when estimating the person's ability or trait level. Although the construct estimate itself, in theory, does not depend on the items, the precision with which we estimate it does depend on the items taken. Estimates of the ability or trait are more precise when they're based on items that are close to a person's construct level. Precision decreases when there are mismatches between person construct and item difficulty. Thus, SEM in IRT can vary by the ability of the person and the characteristics of the items given. The main limitation of IRT is that it is a complex model requiring much larger samples of people than would be needed to utilize CTT. Whereas in CTT the recommended minimum is 100 examinees for conducting an item analysis (see Chapter 6), in IRT, as many as 500 or 1000 examinees may be needed to obtain stable results, depending on the complexity of the chosen model. Another key difference between IRT and CTT has to do with the shape of the relationship that we estimate between item score and construct score. The CTT discrimination models a simple linear relationship between the two, whereas IRT models a curvilinear relationship between them

Confounding Constructs and Levels of Constructs

Imagine a study to test the effect of a new drug treatment for cancer. A fixed dose of the drug is given to a randomly assigned treatment group and a placebo to the other group. No treatment effects are detected. Perhaps the result that's observed is only true for that dosage level. Slight increases or decreases of the dosage may radically change the results. In this context, it is not "fair" for you to use the label for the drug as a description for your treatment because you only looked at a narrow range of dose. Like the other construct validity threats, this is essentially a labeling issue - your label is not a good description for what you implemented. threat to construct validity occurs when other constructs mask the effects of the measured construct

Differential Item Functioning

In an option analysis, we examine categorical frequency distributions for each response option by ability groups. In DIF, we examine these same categorical frequency distributions, but by different demographic groups, where all test takers in the analysis have the same level on the construct. The presence of DIF in a test item is evidence of potential bias, as, after controlling for the construct, demographic variables should not produce significant differences in test taker responses. occurs when groups (such as defined by gender, ethnicity, age, or education) have different probabilities of endorsing a given item on a multi-item scale after controlling for overall scale scores.

Interrater Reliability Intraclass correlation Coefficient (G-Coefficient)

In most cases, systematic differences in ratings do matter, and we want to know how much of an impact they have on our ratings. A simple analysis of mean ratings across raters could be used to supplement a correlation coefficient if this is a concern. However, it's best to use another reliability index in place of the correlation coefficient. The intraclass correlation coefficient (ICC), also known as the G-coefficient from generalizability (G) theory, is the most flexible consistency index because it can be calculated in different ways to either account for or ignore systematic score differences, and it also works with ordered score categories. Thus, ICC addresses the limitations of both the agreement indices and the correlation coefficient. It does so by breaking down the observed score into different components, much like the CTT reliability. Examples: How closely relatives resemble each other with regard to a certain characteristic or traits. Or Reproducibility of numerical measurements made by different people measuring the same thing. the "relative" amount of variation associated with a given facet or its associated interactions.

Split-Half reliability

In split-half reliability, a test for a single knowledge area is split into two parts and then both parts given to one group of students at the same time. The scores from both parts of the test are correlated. A reliable test will have high correlation, indicating that a student would perform equally well (or as poorly) on both halves of the test. -Split-half testing is a measure of internal consistency — how well the test components contribute to the construct that's being measured. It is most commonly used for multiple choice tests you can theoretically use it for any type of test—even tests with essay questions -One drawback with this method: it only works for a large set of questions (a 100 point test is recommended) which all measure the same construct/area of knowledge. For example, this personality inventory test measures introversion, extroversion, depression and a variety of other personality traits. This is not a good candidate for split-half testing.

Messick's Unified View

In the early 1980s, the three types of validity were reconceptualized as a single construct validity ( to be known as Messick's Unified View). This reconceptualization clarifies how content and criterion evidence do not, on their own, establish validity. Instead, both contribute to an overarching evaluation of construct validity. -According to this view, test validity is something of a misnomer since what is validated are interpretations of test scores and the uses of those scores for particular purposes, not the test itself. -Focusing on content or criteria validity is not enough - instead, a comprehensive evaluation of validity is required -Validation is an ongoing process where evidence supporting test use is accumulated over time from multiple sources -In this view, validity is a matter of degree instead of being evaluated as a simple and absolute yes or no.

Inter-rater Reliability

Inter-rater reliability is the level of agreement between raters or judges. If everyone agrees, IRR is 1 (or 100%) and if everyone disagrees, IRR is 0 (0%). There are four commonly used indices of interrater agreement and reliability. The purpose is to estimate the consistency (reliability or unreliability of SEM) due to the raters.

Point-Biserial

Is used when one variable is dichotomous and one variable is continuous (gender and height). An approach for examining item discrimination, it is the correlation between item responses and construct scores or total scores (total scores are typically used). The resulting correlation is referred to as an item-total correlation or sometimes called a point-biserial correlation.

Measurement error

Measurement error lowers reliability so it also lowers validity because it is not consistently measuring the construct, thus it is MISREPRESENTING THE CONSTRUCT.

True Score Theory/Classical Test Theory

It is a theory of testing based on the idea that a person's observed or obtained score on a test is the sum of a true score (error-free score) and an error score Classical test theory assumes that each person has a true score,T, that would be obtained if there were no errors in measurement. A person's true score is defined as the expected number-correct score over an infinite number of independent administrations of the test. Unfortunately, test users never observe a person's true score, only an observed score, X. It is assumed that observed score = true score plus some error: X = T + E (observed score = true score + error)

Desirable Properties

It should have a clear purpose. Study Design is strong -Is the sample representative? -Is the sample randomly or intentionally selected? -Is there clear standardization? Evaluate the Reliability evidence for your test purpose -What types of reliability are estimated? -Are these types appropriate given the test purpose? -Are the study designs appropriate given the chosen types? Evaluate the Validity evidence for your test purpose -What types of validity evidence are examined? -Are these types appropriate given the test purpose? -Are the study designs appropriate given the chosen types of validity evidence? -Does the resulting evidence support the intended uses and inferences of the test? Evaluate the scoring that is implemented with a test -What types of scores are produced? -Measurement scale type? Score scale range?

Piloting

Item analysis is typically conducted with pilot data. Create an initial pool of items, twice as large as the final number needed and obtain a sample of at least 100 examinees from the population.

Item Response vs Scale Score

Item response is the raw score the scaled score allows for the test to be compared to others

Interrater Agreement Kappa

Kappa is an adjusted percentage agreement index that takes chance agreement into account. It is expressed as a proportion, rather than a percentage, so we never multiply by 100 as with percentage agreement. Remember that kappa involves removing chance agreement from the observed agreement, and then dividing this observed non-chance agreement by the total possible non-chance agreement. Percentage agreement and Kappa ignore ratings that are at different levels of agreement, for example, nearly in agreement versus completely in disagreement. This issue has to do with the fact that percentage agreement and kappa ignore any ordinal properties of the rating scales that are being used. Only utilizes categorical data—can turn the data into a category Extent of a match between raters as they assign scores to performance, behaviors, or essays

KuderRichardson Formula 20

Kuder-Richardson Formula 20, or KR-20, is a measure reliability for a test with binary variables (i.e. answers that are right or wrong). Reliability refers to how consistent the results from the test are, or how well the test is actually measuring what you want it to measure. -The KR20 is used for items that have varying difficulty. For example, some items might be very easy, others more challenging. It should only be used if there is a correct answer for each question — it shouldn't be used for questions with partial credit is possible or for scales like the Likert Scale. -If you have a test with more than two answer possibilities (or opportunities for partial credit), use Cronbach's Alpha instead. -The scores for KR-20 range from 0 to 1, where 0 is no reliability and 1 is perfect reliability. The closer the score is to 1, the more reliable the test (generally anything .5 or above is acceptable). -KR-20 is [n/n-1] * [1-(Σp*q)/Var]: where p= proportion of people passing the item and q = proportion of people failing the item.

Evaluation Apprehension

Many people are anxious about being evaluated. Some are even phobic about testing and measurement situations. If their apprehension makes them perform poorly (and not your program conditions) then you certainly can't label that as a treatment effect. Another form of evaluation apprehension concerns the human tendency to want to "look good" or "look smart" and so on. If, in their desire to look good, participants perform better (and not as a result of your program!) then you would be wrong to label this as a treatment effect. In both cases, the apprehension becomes confounded with the treatment itself and you have to be careful about how you label the outcomes.

The Kendall Rank Correlation

Measures the strength of dependence between the sets of two random variables. Kendall can be used for further statistical analysis when a Spearman's Correlation rejects the null hypothesis. It attains a correlation when one variable's value decreases and the other variable's value increases; this correlation is referred to as discordant pairs. A correlation can also occur when both variables increase simultaneously, referred to as a concordant pair.

Knowledge:

Memory of learned materials—recalling facts, terms, basic concepts, and answers. "What are..."

Mono-Method Bias

Mono-method bias refers to your measures or observations, not to your programs or causes. Otherwise, it's essentially the same issue as mono-operation bias. With only a single version of a self esteem measure, you can't provide much evidence that you're really measuring self esteem. Your critics will suggest that you aren't measuring self esteem - that you're only measuring part of it, for instance. Solution: try to implement multiple measures of key constructs and try to demonstrate (perhaps through a pilot or side study) that the measures you use behave as you theoretically expect them to.

Mono-Operation Bias

Mono-operation bias pertains to the independent variable, cause, program or treatment in your study - it does not pertain to measures or outcomes. If you only use a single version of a program in a single place at a single point in time, you may not be capturing the full breadth of the concept of the program. If you conclude that your program reflects the construct of the program, your critics are likely to argue that the results of your study only reflect the peculiar version of the program that you implemented, and not the actual construct you had in mind. Solutions: Try to implement multiple versions of your program.

Multitrait-multimethod

More than one trait and more than one method must be used to establish (a) discriminant validity and (b) the relative contributions of the trait or method specific variance.

Hypothesis Guessing

Most people don't just participate passively in a research project. They are trying to figure out what the study is about. They are "guessing" at what the real purpose of the study is. And, they are likely to base their behavior on what they guess, not just on your treatment. In an educational study conducted in a classroom, students might guess that the key dependent variable has to do with class participation levels. If they increase their participation not because of your program but because they think that's what you're studying, then you cannot label the outcome as an effect of the program. It is this labeling issue that makes this a construct validity threat.

Construct Validity

Nomologic network, convergent validity, divergent validity the extent to which the measure 'behaves' in a way consistent with theoretical hypotheses and represents how well scores on the instrument are indicative of the theoretical construct. Construct validity is about ensuring that the method of measurement matches the construct you want to measure.

Norm-Referenced Tests

Norm-referenced tests report whether test takers performed better or worse than a hypothetical average student, which is determined by comparing scores against the performance results of a statistically selected group of test takers, typically of the same age or grade level, who have already taken the exam.

Observed (Measured) Variable

Observed variables (sometimes called observable variables or measured variables) are actually measured by the researcher. They are data that actually exists in your data files—data that has been measured and recorded. They can be discrete variables or continuous variables. Example: Let's say you were analyzing results from a major depression inventory. Feelings of sadness, lack of interest in daily activities and lack of self-confidence are all measured by the inventory and are therefore observed variables

Scoring - Item Difficulty

Once we have established a scoring scheme for each item in our test, and we have collected response data from a sample of individuals, we can start talking about the first statistic in an item analysis: item difficulty.

Reverse Coding

One common validation technique for survey items is to rephrase a "positive" item in a "negative" way. When done properly, this can be used to check if respondents are giving consistent answers. Tends to work best when items are dichotomously scored

Open Response

Open-ended questions are free-form survey questions that allow respondents to answer in open text format so that they can answer based on their complete knowledge, feeling, and understanding. It means that the response to this question is not limited to a set of options. o Advantages ▪ Can demonstrate knowledge,skills,and abilities in various ways ▪ They invite students to give longer responses that demonstrate their understanding. o Disadvantages ▪Difficult to grade, need to create a rubric

Evaluation

Present and defend opinions by making judgments about information, validity of ideas or quality of work based on a set of criteria "Do you feel/believe that..."

Affective vs Cognitive

Properties of good affective items: they reflect the construct of interest, should have a positive (monotonic) relationship with construct, a positive relationship indicates that higher levels of the attribute are associate with higher item responses); cognitive items - they are dichotomously scored; more items will be needed to obtain a given reliability value; correlations among cognitive items are lower than multi-category items, so reliability will tend to be lower;

The common sources of unreliability.

Random Error Systemic Error

The distinction between the reliability and validity of tests versus scores.

Reliability refers to the accuracy or precision of the measurement procedure Reliability gives an indication for the extent to which the scores produced by a particular measurement procedure are consistent and reproducible Test scores can be reliable, but not valid because validity asks "the degree to which a test measures the construct it purports to measure"

What affects reliability and what reliability tells us about scores from a test.

Reliability refers to the accuracy or precision of the measurement procedure Reliability gives an indication for the extent to which the score produced by a particular measurement procedure are consistent and reproducible

Reliability

Reliability refers to the consistency of a measure. Psychologists consider three types of consistency: over time (test-retest reliability), across items (internal consistency), and across different researchers (inter-rater reliability).

Test-Retest Reliability

Repetition of the measurement over time with the same subject only if the measurement from the first does not affect the measurement of the second -An estimate of the stability of our measurement over time. Given that the same test is given twice, any measurement error will be due to the passage of time, rather than differences between the test forms. -Satisfactory and applicable for simple aspects of behavior or weight -Counterbalance to help with the test-retest effect -Judging reliability on correlation

Response Sets

Response sets describe patterns of response that introduce bias into the process of measuring noncognitive constructs via self-reporting. Bias refers to systematic error that has a consistent and predictable impact on responses. The main response sets include social desirability, acquiescence, extremity, and neutrality. Extremity and neutrality refer to a tendency to over-exaggerate and under-exaggerate response. In both cases, the underlying problem is an inconsistent interpretation and use of a rating scale across examinees. To reduce extremity and neutrality, use dichotomous response options, for example, yes/no, where only the extremes of the scale are available. a stylistic pattern of behavior enacted in one's replies to items on a psychological test or inventory

Standardized scores

Scores that are transformed to be compared to one another on a normal distribution. Scores that reference distance, in standard deviation units, above or below the mean

Scoring - Item Discrimination

Scoring item responses also requires that some direction be given to the correctness or amount of trait involved. Thus, at the item level, we are at least using an ordinal scale. In educational testing, the direction is simple - increases in correctness correspond to increases in points. In psychological testing, reverse scoring may be necessary. (item discrimination)

Survey vs Scale Items

Survey items - represent a single dimension or attribute of interest whereas Scale items - capture some aspect of the attribute, but only represent the attribute when aggregated with other items (psychological traits, educational achievement)

T-scores

Tells us how far a score is from the mean, similar to Z-scores, allows us to make comparisons within groups or within the population Sample scores used

Evaluation of convergent validity

Tests designed to measure the same construct should correlate highly amongst themselves.

Angoff Scales

The Angoff score is the lowest pass/fail cutoff score that a minimally qualified candidate is likely to achieve on a test. The score is calculated by using a panel of experts to determine the difficulty of each test question included in an assessment. The sum of the predicted difficulty values for each item averaged across the judges and items on a test is the recommended Angoff cut score. o Advantages ▪ Using the Angoff Method ensures that the passing grade of a test is determined empirically ▪ Simplicity - the method is easy to understand and compute o Disadvantages ▪ The disadvantages of the Angoff method include the difficulty in uniformly conceptualizing a minimally qualified candidate, and the difficulty the judges have in predicting the performance of a minimally qualified candidate. ▪Other disadvantages include a lack of reliability in the standards across multiple settings and judges, and the potential of arriving at an unrealistic standard.

Item response function

The IRT model for a given item has a special name in IRT. It's called the item response function (IRF), because it can be used to predict an item response. Each item has its own IRF. We can add up all the IRF in a test to obtain a test response function (TRF) that predicts not item scores but total scores on the test. Shows you the amount of information each item provides and it is calculated by multiplying the probability of endorsing a correct response multiplied by the probability of answering incorrectly. The maximum amount of information provided would be given when the probability of answering correctly or wrongly are equal, i.e. 50%. Items are most informative among respondents that represent the entire latent continuum and especially among those who have a 50% chance of answering either way.

Phi Coefficient

The Phi Coefficient is a measure of association between two binary variables (i.e. living/dead, black/white, success/failure). It is also called the Yule phi or Mean Square Contingency Coefficient and is used for contingency tables when At least one variable is a nominal variable and Both variables are dichotomous variables. Kendall Rank Correlation.

Attenuation Paradox

The attenuation paradox refers to the increase in test validity that accompanies increasing test reliability up to a point beyond which validity decreases with further increases in reliability In other words, Up to a point, reliability and validity increase together, but then any further increase in reliability decreases validity. The attenuation paradox appears most clearly in the context of item selection and test construction. In practice, the problem is how to select those items that will simultaneously increase both the reliability and validity of the total test scores.

Evaluation of discriminant (divergent) validity

The construct being measured by a test should not correlate highly with different constructs.

Reliability coefficient

The correlation coefficient that provides a statistical index of the extent to which two measurements tend to agree or to place each person in the same relative position. This indicates if two things that are correlating happen between two applications of the same measure, giving us an indicator of reliability.

Content Validity

The degree to which elements of an assessment instrument are relevant to and representative of the targeted construct for a particular assessment purpose -Assumes the target construct can be broken down into elements, and that we can take a representative sample of these elements -In other words, content validity is the extent to which a test "correlates with" (i.e., corresponds to) the content domain. So, in content validity we compare our test to the content domain, and hope for a strong relationship -For example, a depression scale may lack content validity if it only assesses the affective dimension of depression but fails to take into account the behavioral dimension.

Differences between IRT and CTT

The main limitation of IRT is that it is a complex model requiring much larger samples of people and items than would be needed to utilize CTT. One difference between IRT and CTT: CTT looks at the entire test and Modern Test Theory such as IRT looks at specific items. Another difference: IRT models a curvilinear relationship between item score and construct score, whereas CTT models a simple linear relationship between the two. Another difference: in CTT we focus only on person ability in the model itself whereas, In IRT, we include person ability and item parameters (difficulty, discrimination, and lower-asymptote). (table in doc)

Construct (Latent) Variable

The opposite of an observed variable is a latent variable, also referred to as a factor or construct. A latent variable is hidden, and therefore can't be observed. An important difference between the two types of variables is that an observed variable usually has a measurement error associated with it, while a latent variable does not.

Experimenter/Researcher Expectancies

The researcher can bias the results of a study in countless ways, both consciously or unconsciously. Sometimes the researcher can communicate what the desired outcome for a study might be (and participant desire to "look good" leads them to react that way). For instance, the researcher might look pleased when participants give a desired answer. If this is what causes the response, it would be wrong to label the response as a treatment effect.

Item Pool

This consists of items that are written for our instrument during the piloting phase. The initial pool of items should be at least twice as large as the final number of items needed. The purpose of an item pool is to see how well they work in the piloting phase.

Option Analysis/Distractor Analysis

This is an analysis of response option choices by ability groups. Calculate the frequency distributions for unscored items as categorical variables. Frequencies for certain response options should follow certain trends for certain ability groups. Relationships between the construct and response patters over keyed and unkeyed options can give us insights into whether or not response options are functioning as intended. Our main goal in distractor analysis is to identify dysfunctional and/or useless distractors, ones which do not provide us with any useful information about examinees. Reliability analysis (SPSS output), dropping items one by one while being cautious about the attenuation paradox The distractor analysis provides a measure of how well each of the incorrect options contributes to the quality of a multiple choice item.

Trait characteristics

Traits should be different enough to be distinct, but similar enough to be worth examining in the MTMM.

Construct representation

Two types of threats to content validity: content underrepresentation and content misrepresentation. These can both be extended to more broadly refer to construct underrepresentation and construct misrepresentation. In the first, we fail to include all aspects of the construct in our test. In the second, our test is impacted by variables or constructs other than our target construct, including systematic and random error. And in both, we introduce construct irrelevant variance into our scores.

Comprehension

Understanding of facts and ideas by organizing, comparing, translating, interpreting, describing, and stating main ideas. "Translate, Interpret, extrapolate"

Application

Using new knowledge to solve problems in new situations (applying acquired knowledge, facts, techniques, and rules in a different way). "Applies changes"

Variability

Variability describes how much scores are spread out or differ from one another in a distribution. Some simple measures of variability are the minimum and maximum, which together capture the range of scores for a variable Variance and standard deviation are much more useful measures of variability as they tell us how much scores vary. Both are defined based on variability around the mean the amount of variability can help determine how representative a measure of central tendency is; there are three measures of variability: range, standard deviation, variance -standard deviation -Range -Variance

Proximal Zone of Development

Vygotsky consistently defines the zone of proximal development as the difference between the current level of cognitive development and the potential level of cognitive development. He maintains that a student is able to reach their learning goal by completing problem-solving tasks with their teacher or engaging with more competent peers. Vygotsky believed that a student would not be able to reach the same level of learning by working alone. As a student leaves his zone of current development, he travels through the zone of proximal development towards his learning goal. "What I can do with help" The zone of proximal development consists of two important components: the student's potential development and the role of interaction with others. Learning occurs in the zone of proximal development after the identification of current knowledge. The potential development is simply what the student is capable of learning.

Test Purpose

What constructs are you hoping to measure? For what population/reason?

Truly different methodology

When using multiple methods, one must consider how different the actual measures are. For instance, delivering two self-report measures are not truly different measures; whereas using an interview scale or a psychosomatic reading would be.

Additional Analysis

Whereas item analysis is useful for evaluating the quality of items and their contribution to a test or scale, other analyses are available for digging deeper into the strengths and weaknesses of items. Two commonly used analyses are option analysis, also called distractor analysis, and differential item functioning analysis. Option Analysis/Distractor Analysis Differential Item Functioning

Trace Graph

With a trace graph, you get a visual representation of the relationship between a scale score and an individual item - you can see if there is a positive monotonic relationship with the item and the scale score

Inadequate preoperational explication of constructs

You didn't do a good enough job of defining (operationally) what you mean by the construct. Solutions: -think through your concepts better -use methods (e.g., concept mapping) to articulate your concepts -get experts to critique your operationalizations

Restricted Generalizability Across Constructs

You do a study and conclude that Treatment X is effective. In fact, Treatment X does cause a reduction in symptoms, but what you failed to anticipate was the drastic negative consequences of the side effects of the treatment. When you say that Treatment X is effective, you have defined "effective" as only the directly targeted symptom. This threat reminds us that we have to be careful about whether our observed effects (Treatment X is effective) would generalize to other potential outcomes. there were some unanticipated effects from your program that may make it difficult to say your program was effective.

Stanines

a method of scaling test scores on a 9-point standard scale with a mean of 5 and a SD of 2—use for many statewide testing (i.e., cogat, NeSA, etc.)

Standard error of measurement

addresses the amount of variation to be expected within a set of repeated measures of a single individual. This can be described also as the standard deviation of the variations in our measurements.

Multi-Trait Multi-Method

an approach to examining construct validity. It organizes convergent and discriminant validity evidence for comparison of how a measure relates to other measures. Developed in part as an attempt to provide a practical methodology that researchers could actually use. Multiple traits are used in this approach to examine (a) similar or (b) dissimilar traits ( constructs), as to establish convergent and discriminant validity between traits. Similarly, multiple methods are used in this approach to examine the differential effects (or lack thereof) caused by method specific variance.

Internal Consistency

assesses the correlation between multiple items in a test that are intended to measure the same construct Internal consistency is an assessment of how reliably survey or test items that are designed to measure the same construct actually do so -A high degree of internal consistency indicates that items meant to assess the same construct yield similar scores. There are a variety of internal consistency measures. Usually, they involve determining how highly these items are correlated and how well they predict each other. -If we administer one test only once, we no longer have an estimate of stability, and we also no longer have an estimate of reliability that is based on correlation. Instead, we have an estimate of what is referred to as the internal consistency of the measurement. -This is based on the relationships among the test items themselves, which we treat as miniature alternate forms of the test. The resulting reliability estimate is impacted by error that comes from the items themselves being instable estimates of the construct of interest.

Z-scores

based on a scale that has a mean (or average)—allows us to describe whether the student performed above or below average (make comparisons) ▪ z-scores can be converted to any scale with a given mean and standard deviation ▪Three common examples are t-scores,which have mean of 50 and standard deviation 10, IQ scores, which have a mean of 100 and standard deviation of 15, and SAT scores, which used to have a mean of 500 and standard deviation of 110

Continuous

can take on an unlimited number of values between the lowest and highest points of measurement. (i.e., on a number line, 0=0) (e.g. speed, distance, height, circumference)

Categorical

categories like gender, grade, etc.

Grade equivalents

compared to students in that grade

Age equivalents

compared to students that age

Variance

equals the squared standard deviation, that is the average squared distance from the mean

Analysis

examine and break information into parts by identifying motives or causes. Make inferences and find evidence to support generalizations. "List and break down"

Item Difficulty

how easy or difficult each item is for our sample. -In cognitive testing, we talk about easiness and difficulty, where test takers can get an item correct to different degrees, depending on their ability or achievement. In noncognitive testing, we talk instead about endorsement or likelihood of choosing the keyed response on the item, where test takers are more or less likely to endorse an item, depending on their level on the trait. -item difficulty will be estimated as the predicted mean ability required to have a 50% chance of getting the item correct or endorsing the item. -In CTT, the item difficulty is simply the mean score for an item. In item response theory (IRT), item difficulty is the predicted mean ability required to have a 50% chance of getting the item correct.

Predictive Validity

if the test accurately predicts what it is supposed to predict. For example, the SAT exhibits predictive validity for performance in college. It can also refer to when scores from the predictor measure are taken first and then the criterion data is collected later.

Postdictive

if the test is a valid measure of something that happened before. For example, does a test for adult memories of childhood events work?

Ratio

intervals with zero points): Whatever the variable, to be measured on a ratio scale, four conditions must be met. First, each unique possible value on the scale must mean something different than the remaining values. Second, there must be an order to the scale values, where lower scores indicate lower levels on the variable and higher scores indicate higher levels on the variable. Third, the differences between values must mean the same thing across the scale. Fourth, there must be a zero point on the scale and it must denote an absence of the variable that is measured Hasa Meaningful 0(you can actually weighs 0 pounds)which allows for to multiply/divide and it is meaningful (weight can actually be multiplied) (e.g., the amount of pounds of sugar consumed) Money is another example

Spearman-Brown Prophecy Formula

is a measure of test reliability. It's usually used when the length of a test is changed and you want to see if reliability has increased. -Ratio of the new test length over the old length -Does not assume homogenous content across all items (like coefficient alpha does) -Only assumes homogenous content across both halves -Called the prophecy formula because it predicts the reliability of a measurement from a single administration -For the formula to work properly, the two tests must be equivalent in difficulty. If you double a test and add only easy/poor questions, the results from the Spearman-Brown Formula will be invalid. -Although increasing test items is one way to increase reliability, it's not always possible to do so. For example, doubling the (already lengthy) GRE would lead to examinee fatigue.

Standard deviation

is the average distance from the mean; the building blocks for measures of variability are the sum of squares (SS); the standard deviation must be interpreted relative to the scale of the variable being considered

Nominal

no order) Each number takes on the meaning of a verbal label (e.g., the names of people, groups, sex, religious affiliation, binary responses (like true false)

Likert-Type Extra

o Assumptions ▪ A Likert scale assumes that the strength/intensity of an attitude is linear, i.e. on a continuum from strongly agree to strongly disagree, and makes the assumption that attitudes can be measured. o ResponseOptions ▪ Two main issues arise in the construction of a rating scale for gathering responses. First, the length or number of scale anchors must be determined. An effective rating scale is only as long as necessary to capture meaningful variability in responses. ▪ A second issue in rating scale construction is whether or not to include a central, neutral option. Although examinees may prefer to have it, especially when responding to controversial topics where commitment one way or the other is difficult, the neutral option rarely provides useful information regarding the construct being measured, and should be avoided whenever possible o Response Format ▪ Issues with how many levels (5-7-10 points) of variability to include, Additional scale points can potentially lead to increases in score variability, however this variability could reflect an inconsistent use of the scale (measurement error) rather than meaningful differences between people. ▪ Numerical and visual cues may be more helpful than just numbers.

Multiple Choice

objective assessment in which respondents are asked to select only correct answers from the choices offered as a list. -Most flexible oftheobjectiveitemso -Stem=the problem -Has a correct answer and foils/distractors -Must have three options o Advantages ▪ reliability has been shown to improve with larger numbers of items on a test, and with good sampling and care over case specificity, overall test reliability can be further increased ▪ Multiple choice tests often require less time to administer for a given amount of material than would tests requiring written responses. ▪ Because this style of test does not require a teacher to interpret answers, test-takers are graded purely on their selections, creating a lower likelihood of teacher bias in the results. oIssues/Disadvantages ▪ The most serious disadvantage is the limited types of knowledge that can be assessed by multiple choice tests. Multiple choice tests are best adapted for testing well -defined or lower-order skills. ▪ Possible ambiguity in the examinee's interpretation of the item ▪ Even if students have some knowledge of a question, they receive no credit for knowing that information if they select the wrong answer and the item is scored dichotomously. ▪ A student who is incapable of answering a particular question can simply select a random answer and still have a chance of receiving a mark for it.

Alternate Forms Reliability

occurs when an individual participating in a research or testing scenario is given two different versions of the same test at different times. The scores are then compared to see if it is a reliable form of testing. One administration, two tests. -Scores of the two tests are correlated, and a reliability coefficient is calculated. A test would be deemed reliable if differences in one test's observed scores correlate with differences in an equivalent test's scores. -To eliminate issues, the two tests should have identical instructions, numbers of items and other core elements. The only difference should be different items. -Tests should be administered close together. -Practice and transfer effects can be eliminated if half the subjects take test A followed by test B, and half the subjects take test B followed by test A. Note that although this seems a little strange (what's the point in subjects taking two different tests instead of one?), remember that you're assessing reliability here, not subject performance. Once you've determined that the tests are reliable, you can administer test A or test B to a subject, with the knowledge that the two tests are equivalent in every way.

Pearson Product Moment

pearson's r is attenuated (i.e., made smaller) by unreliable measures; the correction for attenuation gives that correlation that would be observed had the two measures been perfectly reliable; assuming perfect reliability may not be possible; it is possible to correct the correlation using reliability values that are "more reasonable", but not perfect For the Pearson r correlation, both variables should be normally distributed (normally distributed variables have a bell-shaped curve). Other assumptions include linearity and homoscedasticity. Linearity assumes a straight line relationship between each of the two variables and homoscedasticity assumes that data is equally distributed about the regression line.

Social desirability

refers to a tendency for examinees to respond in what appears to be a socially desirable or favorable way. Examinees tend to under-report or de-emphasize constructs that carry negative connotations, and over-report or overemphasize constructs that carry positive connotations. Social desirability can be reduced by reducing insight, encouraging immediate response, and limiting the use of contexts that have obvious negative or positive connotations.

Normalized scores

scores that are related to the norm group used when the test was

Item Analysis

test items make up the most basic building blocks of an assessment instrument. Item analysis lets us investigate the quality of these individual building blocks, including in terms of how well they contribute to the whole and improve the validity of our measurement. Piloting Scoring- Item difficulty Scoring- Item Discrimination Internal Consistency Additional Analysis

Likert-Type

the Likert scale is a five (or seven) point scale which is used to allow the individual to express how much they agree or disagree with a particular statement. A Likert scale assumes that the strength/intensity of an attitude is linear, i.e. on a continuum from strongly agree to strongly disagree, and makes the assumption that attitudes can be measured. o Advantages ▪ Do not expect a simple yes / no answer from the respondent, but rather allow for degrees of opinion, and even no opinion at all. ▪ Quantitative data is obtained, which means that the data can be analyzed with relative ease. oIssues/Disadvantages ▪ the validity of the Likert scale attitude measurement can be compromised due to social desirability. ▪ the Likert Scale is uni-dimensional and only gives 5-7 options of choice, and the space between each choice cannot possibly be equidistant. Therefore, it fails to measure the true attitudes of respondents.

Interaction of Different Treatments

the interaction of treatments are responsible for the outcome. So the outcome is misattributed to one treatment instead of the combination. In this case you cant say the outcomes accurately represent the construct.

Percentile ranks

the percentage of a norm group below a given score

Range

the range of scores for a given set of data (would not be the best indication of variability, yes we know the range of scores, but how are they distributed)

Raw scores

the raw score of number correct without any changes made

Spearman Rank-Order

when the assumptions of Pearson's r cannot be met, the Spearman rs can be used. This method uses ranks of scores instead of the scores themselves. A rank is a number given to a score that represents its order in a distribution. For example, in a set of 10 scores, the highest score receives a rank of 1, the fifth score from the top receives a rank of 5, and the lowest receives a rank of 10. The assumptions of the Spearman correlation are that data must be at least ordinal and the scores on one variable must be monotonically related to the other variable.

Concurrent Validity

when the predictor and criterion data are collected at the same time. It can also refer to when a test replaces another test (i.e. because it's cheaper). For example, a written driver's test replaces an in-person test with an instructor. indicates the amount of agreement between two different assessments

CTT Review

x= t+e CTT gives us a model for the observed total score X. This model decomposes X into two parts, truth T and error E. The true score T is the construct we're intending to measure, and we assume it plays some systematic role in causing people to obtain observed scores on X. The error E is everything randomly unrelated to the construct we're intending to measure. Error also has a direct impact on X. It should be apparent that CTT is a relatively simple model of test performance. The simplicity of the model brings up its main limitation: the score scale is dependent on the items in the test and the people taking the test. The results of CTT are said to be sample dependent because 1) any X T or E that you obtain for a particular test taker only has meaning within the test she or he took, and 2) any item difficulty or discrimination statistics you estimate only have meaning within a particular sample of test takers. So, the person parameters are dependent on the test we use, and the item parameters are dependent on the test takers. A second major limitation of CTT results from the fact that the model is specified using total scores. Because we rely on total scores in CTT, a given test only produces one estimate of reliability and, thus, one estimate of SEM, and these are assumed to be unchanging for all people taking the test. The measurement error we expect to see in scores would be the same regardless of level on the construct. This limitation is especially problematic when test items do not match the ability level of a particular group of people. For example, consider a comprehensive vocabulary test covering all of the words taught in a fourth grade classroom. The test is given to a group of students just starting fourth grade, and another group who just completed fourth grade and is starting fifth. Students who have had the opportunity to learn the test content should respond more reliably than students who have not. Yet, the test itself has a single reliability and SEM that would be used to estimate measurement error for any score. Thus, the second major limitation of CTT is that reliability and SEM are constant and do not depend on the construct


Conjuntos de estudio relacionados

Study Guide Is a document that describes the party's views on all the major issues facing the nation. Goals of the Constitution, Lessons 17, 19 + 20

View Set