psych testing exam 2 (ch. 4)

Ace your homework & exams now with Quizwiz!

EXAM ALERT: Practice effects are mainly a concern for which type of reliability determination? A. test-retest B. split-half C. Kuder-Richardson D. alternate form

A. test-retest

EXAM ALERT: A numerical summary of the relationship depicted in a bivariate distribution is called a ___. A.scattergram B. linear regression C. standard error of estimate D. correlation coefficient

D. correlation coefficient

EXAM ALERT: Which of these correlation coefficients indicates the weakest relationship? A. r = -.199 B. r = +.60 C. r = -.79 D. r = +.007

D. r = +.007

factors affecting correlation coefficients (r)

definition: a number between −1 and +1 calculated so as to represent the linear dependence of two variables or sets of data. 1-linearity (curvilinearity) 2-heteroscedasticity 3-relative (not absolute) position 4-group heterogeneity (a big issue)

validity vs reliability deals with what...

validity - deals with what a test measures (does the test measure what it says its supposed to measure?) reliability - deals with only the consistency of the measure - regardless of what it is measuring (reliability suggests trustworthiness, like are scores trustworthy)

* A measure may be reliable without being valid!!! * A test cannot be valid unless its reliable

-

3. correlation is a matter of *relative position* not absolute score

-

correlations

- most of the correlations encountered in psych testing are Pearson correlations

reliability part 2

-*A distinction must be made between real change in the trait being measured and fluctuations in scores attributable to fleeting changes in personal circumstances, the "luck of the draw" in what form of a test is taken, or differences due to who scores the test.* -Real changes in the trait being measured aren't considered a source of unreliability...other factors would usually be considered.

reliability part 3*

-A *constant error* is one that leads to a person's score being systematically high or low -Different from the constancy in the person's status on the trait being measured (Testing someone in English who is ESL. Intelligence level will be underestimated, and will be constant no matter when the child is tested.) -Test savvy Jessica who knows how to detect clues to the right answer, But doesn't know much about the subject she is being tested on. She will get a higher score and will do so regardless of when she is tested. -Reliability doesn't account for these constant errors, it only deals with *unsystematic errors*. -Constants aren't really constants, but tendencies to move scores in a certain direction

1. test scoring (part 2)

-A lack of agreement among scorers may result in unsystematic variation in person's test scores. -Machine scoring of "choice" items generally eliminates such variation, although even machine scoring may not be completely error free. -The more judgment required in scoring, the more worrisome the potential source of unreliability becomes. -When judgment is required, the goal is to have scoring directions that are sufficiently clear and explicit that scorer variation is minimized.

3. test administration conditions

-A test should have standardized procedures for its administration. This includes such factors as the directions, time limits, and physical arrangements for administration. -However, controlling every variable during the administration is impossible and yet, will have an influence on test scores. (Noises in the hallway outside the testing room. Less than ideal lighting conditions If there are 20 second time limits, one administrator may give 21 while another may give 19.)

whats even more important than reliability? validity.

-Although a test with no reliability cannot have any validity, it is possible to have a highly reliable test that is not valid for the purpose we have in mind. -A test with moderate reliability and moderate validity is preferable to a test with high reliability and low validity.

3. coefficient alpha (cronbachs a)

-Does not require dichotomously scored items, items can have any type of continuous score...For examples, items on an attitude scale might be on a 5-point scale, ranging from strongly disagree (1) to strongly agree (5). -Very widely used in contemporary testing -each item may be thought of as a mini-form of the test -sum all info into a measure of internal consistency reliability

2. inter-rater reliability (part 2)

-In some studies, more than 2 scorers are used and its possible to compute all the possible combinations of raters then average the correlations. -The more appropriate way to analyze this is through the *intraclass correlation coefficient (ICC).* ...Computed from Mean squares developed in an ANOVA...Interpreted like the familiar Pearson correlation coefficient (r) -Doesn't provide info on unsystematic errors arising from variations in scorers or provide info on any other source of error -Particularly important when judgement enters into the scoring process.

special issues in reliability:

-Interpretive (narrative) Reports...Need to express limited reliability...May give impression reliability is not an issue...but its ALWAYS an issue -Subscores and Individual Items...Often very limited reliability...One cannot assume that clusters or individual items have the same reliability as the total scores on the test -Profiles...Often the basis for test interpretation...Complex, related to SE of difference, even greater when a profile of 3 or more scores is basis

conceptual framework (graphic 2) (slide 39)

-It's convenient to think of a person's true score as the average of many observed scores. Test A: the test is very reliable; observed scores cluster tightly around the true score T. Test B: the test is not very reliable; observed scores scatter widely around the average or true score T. The difference b/w any observed score O and T in these distributions is the error E in the measurement. It is usually assumed that the observed scores are normally distributed around the true score. -We never know a person's true score, we only have an observed score.

how high is high enough?

-No easy answer... Depends on purpose, sources of unreliability -Some guidelines:...For important individual decisions: .95 goal. EPPP; Forensic classification of ID ....When combined with other info: .80min....Considerable caution:.70 min. Definitely with other info.....For research, group differences: .60 min -Cautions...Reliability is always important...info that's not reliable shouldn't be used....Short tests: not an excuse to allow for .60 to .70..Significance: not sufficient, we have higher standards than simple statistical significance

conceptual framework (graphic 1) (slide 38)

-Panel A shows a test for which the true variance represents only about half of the observed variance; the remainder is error variance. -Panel B shows a test for which the error variance is a relatively small fraction of the total observed variance; most of the variance is true variance...in other words... The test in Panel B has MUCH BETTER RELIABILITY than the test in Panel A

1. split-half reliability

-Recall the alternate forms method...2 forms in immediate succession. -Now we are going to do a single test admin, but score it in halves, as if each half were an alternate form....Correlate scores on 2 halves like a mini-alternate forms #1) the test is not split into 1st & 2nd halves..2nd half is more difficult- examinees are more fatigued and if there is any effect on timing- would influence the 2nd half. -how? divide into odd-numbered, even-numbered items = odd-even reliability -correlation b/w two halves doesn't give reliability of full length test so a correction must be applied to the correlation b/w the two halves of test to yield the reliability of the full-length test (*correction is called spearman brown correction*)

conceptual framework: true story theory (part 2)

-Relationships among observed, true, and error scores: T = O ± E (toe) O = T ± E (ote) +E = T - O (eto) -Notice that the error score can be positive (+) or negative (-)

two additional topics

-Reliability in IRT...The standard error SE () in IRT is often referred to as an index of the precision of measurement...Determined specifically (and can differ) for each score level -Generalizability Theory..An attempt to assess many sources of unreliability simultaneously...ANOVA provides the basic framework

conceptual framework: true story theory (part 1)

-Reliability of tests may be formulated within 3 somewhat different theoretical contexts: classical test theory (CTT), item response theory (IRT), and generalizability theory (GT). -The great majority of the reliability information currently encountered in test manuals, professional journals, and test score reports relies on Classical Test Theory (CTT), so this chapter concentrates on that framework Key Terms: -True Score (what we want to know) Score a person would get if all sources of unreliability was removed or cancelled -Obtained/observed score (actual raw score) May be affected + or - by various sources of unreliability (i.e., lucky guesses, tired test takers). -Error score (junk, noise) Difference b/w true score and observed score...may be + or -. It is the summation of all the unsystematic influences on a person's true score that were considered under sources of unreliability.

3. Alternate Forms Reliability (parallel form or equivalent form reliability)

-Requires that there be 2 forms of a test...Should be the same or similar in terms of # of items, time limits, content specifications, etc. -Administer both forms of the test to the same examinees...Reliability is the correlation, usually Pearson, b/w scores obtained from the 2 forms...May be administered in immediate succession, otherwise intervals are same as test-retest...Longer tests- interval can be few days to few weeks (method measures only reliability d/t content sampling and changes in personal & admin conditions.) -Not used very much b/c most tests don't have 2 forms. It's hard enough to build 1 good test....let alone 2. Primarily used for widely used tests.

4. personal conditions

-Temporary conditions of examinees can have unsystematic influences on test scores. -Feeling under the weather? You may get a somewhat lower score. Tested later after feeling better could garner a few extra points. -In a bad mood? You score on a personality measure could be somewhat different than if you were mellowed out. -No difference in the person's status on the trait being measured, but the temporary personal condition has influenced the scores, but do not result in immediate unreliability.

3 important conclusions

-Test length is important!!...In general, the longer the test, the more reliable it will be. Very short tests are often unreliable. To increase reliability, increase test length. -Reliability is maximized when percentage of examinees responding correctly in a cognitive ability test or responding in a certain direction ("yes") in a noncognitive test is near .50....pq is at a maximum when p = .50; pq decreases as p moves away from .50...the reason why cognitive tests seem harder...the developer is trying to maximize reliability. -Correlation among items is important....To get good internal consistency reliability, use items measuring a well-defined trait.

reliability part 1

-Test reliability has a more technical and quantitative meaning -synonyms: consistency, replicability, stability, and dependability. -Reliable tests are ones that consistently yields the same or similar score for an individual. -It can be replicated at least within a certain margin of error. -We can depend on the test to yield much the same score for an individual (Lack thereof, implies inconsistency and imprecision, both of which are equated with measurement error!)

1. curvilinearity (slide 24)

-The Pearson correlation coefficient, which is by far the most widely used type of correlation, accounts only for the degree of linear relationship b/w 2 variables. If there is some degree of nonlinearity, the Pearson correlation will *underestimate* the true degree of relationship. A Pearson correlation will account for the linear part of the relationship, as shown by the straight line, but not for the nonlinear trend which is shown by the curved line.

standard error of measurement (SEM) ch.4 slide 55

-The standard deviation of a hypothetically infinite number of obtained scores around the person's true score -The SEM can be used to create a confidence interval, which testing parlance is sometimes called "confidence band" around the observed score.

4. (group heterogeneity) effect of range restriction on r (slide 27)

-The standard deviation or variance defines a group's variability. In this context, variability is often called heterogeneity (difference) or its opposite homogeneity (sameness). -If we calculate r for the very *heterogeneous group* included within frame A, we get very high r. If we calculate r for the more homogeneous group included within frame C, we get a much lower r. For cases in frame B, we get an intermediate value of r. -The example above is somewhat artificial b/c it suggests restricting range simultaneously on both X and Y variables. In practice, *range is usually restricted in just one way*, but it has the same kind of effect as illustrated here.

(Look at graphs on slide 16 of ch.4 notes)

-The value of r can range from -1.00 to +1.00. PANEL A: An r of +1.00 represents a perfect positive linear relationship b/w 2 variables. PANEL C: An r of -1.00 represents a perfect negative linear relationship. PANEL B: An r of .00 represents lack of relationship b/w the 2 variables

1. test scoring (part 1)

-Variation in test scoring as a source of unreliability is one of the easiest to understand. -Concern for differences in scores from one scorer to another - even on simple tests like spelling or arithmetic computation- was a major force in the development of multiple-choice items in achievement and ability testing. -Consider how much more variation there may be in scoring responses to open ended questions in an individually administered intelligence test, a scale for rating creativity, or a projective test of personality. -Many intelligence tests have vocabulary items and the administrator says a word and the examinee is to provide an acceptable definition.

2. test content

-Variations in sampling of items in a test may result in unsystematic error in test scores. -2 students preparing for four question 6 chapter history essay exam: Student 1- concentrates on first 4 chapters, skims last 2. Student 2- skim first 2 chapters, concentrates on last 4. 3 of 4 essay questions end up being pulled from last 4 chapters....how does this end up affecting the student's scores? What if they were pulled from the first 4 chapters? -Individuals' scores or not b/c of differences in trait being measured, but b/c of random changes in particular set of items presented in test

actual Y scores around Y (slide 22)

-We assume a normal distribution of the Y's for each value of X all along the prediction line. -standard deviation, called the standard error of estimate or standard error of prediction.

2. heteroscedasticity (slide 25)

-We assume that Y scores are normally distributed around any predicted Y score Y1 and that degree of scatter is equal for any point along the prediction line, known as *homoscedasticity* (equal scatter). - It is also possible for the bivariate distribution to display *heteroscedasticity* (different scatter). -Notice the data points cluster tightly around the trend line in the lower part of the distribution but the points scatter more widely toward the top of the distribution. In this case, the *standard error is not equal throughout the range of the variables*, although we calculate it as if it were equal throughout.

regression graphic (slide 21)

-We use the line to predict status on Y from status on X. -It's important to note that the higher the r is, the less the scatter. The lower the r is, the more the scatter.

Review of statistics

-bivariate distribution - relationship b/w 2 variables (also known as a scattergram) -correlation coefficient - provides a numerical summary of the relationship depicted in a bivariate distribution

2. inter-rater reliability (part 1)

-inter-scorer (also called inter-observer, inter-rater reliability) reliability assesses unsystematic variation d/t who scores the test...Who usually means 2 different people. -A test taken by a group of examinees is scored twice. The inter-scorer reliability is simply the correlation b/w the scores obtained by the 1st and 2nd scorers -Its important the 2 scorers work independently so as not to be influenced or inclined to assign same or similar scores - inflating the resulting reliability coefficient.

2. kuder richardson

-is a measure of internal consistency reliability for measures with dichotomous choices ex yes or no -similar to a split-half correlation, but it can be thought of as the mean split-half correlation if the test is divided in two in all possible ways. (This is not exactly true but makes for a good mental image of what is going on.)

4. internal consistency reliability

-one of the most frequently used methods of expressing test reliability 1.split half 2. kuder-richardson 3. coefficient alpha -all of these methods attempt to measure the common characteristics of the tests internal consistency.

standard errors: 3 types ch.4 slide 58

1. Standard Error of Measurement: the SD of a hypothetical population of observed scores distributed around the true score for an individual person. 2. Standard Error of the Mean: the SD of a hypothetical population of sample means for samples (of a given size) distributed around the population mean.Used for tests of significance: t-tests, z-tests and for confidence intervals for means of samples. 3. Standard Error of the Estimate: the SD of actual Y scores around the predicted Y scores when Y is predicted from X.

Methods for Determining Reliability 1. test-restest reliability (part 1)

1. Test-retest reliability: -Obtained by administering the same test to the same individuals on two separate occasions. One day to one month apart. -The reliability coefficient is simply the correlation (usually Pearson) b/w scores on 1st and 2nd administrations.Often called a stability coefficient -helps assess influence of changes in personal conditions -does not address the influence of changes in test content since exactly the same test is used.

major sources of unreliability (four)

1. test scoring 2. test content 3. test administration conditions 4. personal conditions

EXAM ALERT: Which of the following is NOT a major source of unreliability? A. test selection B. test scoring C. test administration conditions D. test content

A. test selection

EXAM ALERT: What is the effect of "constant errors" on test reliability? A.They increase reliability. B. They decrease reliability. C. They have no effect on reliability. D. They increase reliability for low scores but decrease it for high scores.

A.They increase reliability.

EXAM ALERT: Which is NOT one of the internal consistency methods of determining reliability? a. coefficient alpha b. split-half c. odd-even d. inter-scorer

d. inter-scorer

EXAM ALERT: a test must be ____ in order to be ____ a.short..reliable b. machine-scored...reliable c. punished... useful d. reliable...valid

d. reliable...valid

EXAM ALERT: As test reliability approaches 0 (zero), the standard error of measurement approaches ___. a. infinity b. 0 (zero) c. the test's mean d. the test's standard deviation

d. the test's standard deviation

Methods for Determining Reliability 1. test-restest reliability (part 2)

drawbacks: -Doesn't take into account unsystematic error d/t variations in test content. -For longer tests it's a nuisance to obtain test-retest reliability -Concerns on the effects of the first test on the second test...Recall responses, look up answers in b/w admins, generally know more about "what to expect." -Timing is a concern- long enough to have minimal influence, but not so long that measured trait undergoes real change.


Related study sets

Random Variables and Discrete Probability Distribution

View Set

French Multiple Choice Possibilities

View Set

Algebra 1 Unit 5 Linear Functions

View Set

Human Growth and Development Final Study Quiz

View Set

Chapter 6 - Network Security Devices, Design, and Technology

View Set