Psy 370 Exam 2
Heterotrait heteromethod
Different trait, different method discriminant validity: a mea
What will decrease the SEM?
higher reliability smaller std deviation
Multitrait-multimethod correlation matrix:
presents all the correlations between a group of different traits, attributes, or constructs, each measured by two or more different measurement methods or tests.
Effect sizes r and r^2
r .1 small .3 medium .5 large r^2 1% small 9% medium 25% large
When do we use the predictive method?
-Used when it is important to show a relationship between test scores and a future behavior...ie.., if the test scores and the criterion scores have a strong relationship. -Appropriate for validating employment tests
Heterotrait-monomethod
(different trait, same method) method of variance: variance due to method can be detected by seeing if the different-trait, same‑method correlations are stronger than the different-trait, different-method correlations.
monotrait-heteromethod
(same trait, different method) convergent validity: measures the same shit but in different ways The validity diagonal (monotrait-heteromethod correlations) contains the correlations of the same traits measured by different methods
In what situations, do you use evidence based on content v evidence based on relations with external criteria v evidence based on relations with other constructs?
-Evidence based on content: for tests that measure concrete attributes (observable and measurable behaviors) -Evidence based on relations with external criteria: for tests that predict outcomes -Evidence based on relations with other constructs: for tests that measure abstract constructs
Test-retest and formula
-Extent to which an individual obtains a similar score on different administrations Process:-Administer test to same group on two different occasions Objective: Assess stability of test over time Analysis:-Correlation: scores from the first and second administrations are then compared/correlated -Assumes test takes have not changes between 1st and 2nd test administration in terms of skill/quality being measured Formula: pearson product moment correlation
Coefficient Alpha
-For tests that have more than two possible answers (rating scales) -May be used for scales made up of questions with only one right answer
Coefficient of determination
-Helps us determine the amount of variance shared by the test and criterion Obtained by squaring the validity coefficient, so r=.3 --> r^2=.09 -Larger coefficients represent stronger relationship of greater overlaps between test and criterion -"What amount of variance do the test and criterion share?"
Face validity
-How test takers perceive the attractiveness and appropriateness of a test -Tells us nothing about what a test really measures -Might influence approach to test -Though non -statistical it might be what motivates a respondent to take the survey and take it seriously
Construct validity
-Involves accumulating evidence that scores relate to observable behaviors in the ways predicted by the underlying theory -Because all tests measure one or more constructs, the traditional term "construct validity" is synonymous with "validity"
Demonstrating evidence of validity after test development
-Involves examining the extent to which experts agree on the relevance of the content of test items -Experts review and rate how essential test items are to the attribute measured -We calculate Lawshe's (1975) Content Validity Ratio to determine agreement among experts. The more experts, the lower the minimum value
Classical Test Theory
-No instrument perfectly reliable or consistent -All test scores contain some error X=T+E X= observed score T= true score E= random error
Establishing a test format
-Written test: a paper-and-pencil test in which a test taker must answer a series of questions. -Practical test: requires a test taker to actively demonstrate skills in specific situations. -Instructional objectives: guides students' learning of the terms and concepts of reliability/precision and validity.
Test users must
-be aware of test bias and how it affects test validity -ensure tests are valid for minority populations -ensure tests are free of Qs requiring specific cultural background -use appropriate norm groups for minorities
Test takers
-fatigue -illness -exposure to test questions before the test -not providing truthful and honest answers
Standard error of measurement? What does it allow us to do?
-is an index of the amount of uncertainty or error expected in an individual's observed test score -aka how much the individual's observed test score might differ from the individual's true test score It allows us to quantify the amount of variation in a person's observed score that measurement error would most likely cause SEM=σsqrt(1-rxx) σ = standard deviation of one administration of the test scores rxx= reliability coefficient
What are the 5 sources of validity evidence?
1) Evidence based on test content: the extent to which items are representative of the construct the test measures (generally non-statistic, based on judgement, SMEs) 2) Evidence based on response process: involves observing test takers as they respond to the test or interviewing them when they complete the test to understand the mental processes they use to respond 3) Evidence based on internal structure: previously construct validity: involves examining if the conceptual framework used in test development can be demonstrated using appropriate analytical techniques 4) Evidence based on relations with other variables: previously criterion-related validity/construct validity: correlating test scores w other measures to see if they are related. The extent to which test scores are systematically related to relevant criterion. Involves accumulating evidence that a test is based on sound psychological theory. Indicates any concept or characteristic that a test is designed to measure. 5) Evidence based on the consequences of testing: involves determining if the correct inferences are being made based on the interpretation of the test scores
Test administration
-not following administration instructions -disturbances during the test period -answering test takers' questions inappropriately -extremely cold or hot room temperature
Test scoring
-not scoring according to instructions -inaccurate scoring -errors in judgment -errors calculating test scores
Test itself
-poorly designed -trick questions -ambiguous questions -poorly written questions -reading level higher than the reading level of target population
Test publishers should
-prevent test misuse by making test manuals and validity information available / accessible before purchase -refuse to provide test materials to persons who do not have testing credentials or who are likely to misuse the tests
Linear regression
-statistical procedure for predicting performance on a criterion using one set of test scores Y′ = a + bX where Y′ = the predicted score on the criterion a = the intercept b = the slope X = the score the individual made on the predictor test
Validity coefficient
-the resulting correlation coefficient when two sets of scores are correlated - usually test scores and criterion score a statistic used to infer the strength of the evidence of validity that the test scores might demonstrate in predicting job performance. rxy
Gathering Theoretical Evidence
1) List associations construct has with other constructs 2) Propose one or more hypotheses using the test as an instrument for measuring the construct -Establish nonmological network -propose experimental hypothesis
What are the four sources of error?
1) test itself 2) test administration 3) test scoring 4) test takers
Factors that Influence Reliability
1) test length 2) homogenityof questions 3) test-retest interval 4) administration 5) scoring 6) cooperation of test takers
Concurrent method process
1)Administer test and criterion measure at same time to same group of people 2)Correlate tests scores with the criterion
The predictive method process
1. Group of people take the test (the predictor) 2. Scores held for a pre-established time interval 3. When time has elapsed, researchers collect a measure of some behavior (the criterion) 4. Correlate tests scores with the criterion
Questions w high internal consistency v low internal consistency?
7+8 similar to 8+3 3+8 not same as 150 x 300
For example: rxx = .8 and σ = 20 and test score = 76
95% CI = 76 +/- [1.96 x [20 x sqrt(1 - .8)]] = [68.16, 83.84]
Internal consistency and formula
A measure of how related items or groups of items on a test are to each other -Process: heterogenous test or homogenous subsets is split in half and scores on first half compared with scores on second half -Objective: assesses how related items or groups of items are to one another -Analysis: scores on both halves are correlated -Administer test to a single group to see if the test items all share something in common -Appropriate for homogeneous tests -Pearson product moment correlation corrected for length by Spearman-Brown formula Also: coefficient alpha or KR-20
Confidence interval
A range of scores that we feel confident will include the test taker's true score 95%CI=X +/- 1.96(SEM) 1.96= the 2 points on the normal curve that include 95% of the scores
Alternate-Forms and formula
Alternate forms: the test developer creates two different forms of the test two ways to do it: one way is to have everyone take test 1, then test 2. other way to do is to have first half of class take 1 and second half of class take 2 at the same time -Pearson product moment correlation
What is alternative to norm and criterion referenced test?
Alternative is Authentic Assessment - assesses student's ability to to perform real-world tasks by applying the knowledge and skills he or she has learned
Scorer reliability aka interrater reliability and formula
Amount of consistency among scorers' judgments -Process: two or more people score the same tests -Objective: Assess consistency among scorers' judgements -Analysis: scores by both scorers are correlated or intraclass correlations used if more than 2 raters -two or more individuals score the same test Formula: Pearson product moment correlation
Factor analysis
An advanced procedure that helps investigators explain why items within a test are correlated or why two different tests are correlated
Cohen's Kappa
An index for calculating scorer reliability/inter-rater agreement when scorers make judgments that result in nominal and ordinal data (pass/fail essay questions, rating scales on personality inventories) -ranges from -1 to 1 1= complete agreement sum up expected count of agreement agreements(N)-expected count agreements/ total count(N)- expected count agreement
Reliability coefficient
An index of the strength of the relationship between two sets of test scores r=correlation coefficient rxx= reliability coefficient
What is generalizability theory?
Another approach for estimating reliability -Concerned with how well and under what conditions we can generalize an estimation of reliability of test scores from one test administration to another -Proposes separating sources of systematic error from random error in order to eliminate systematic error -breaks down variation and error into multiple sources (raters, occasions, items)
Construct
Behaviors: actions that are observable and measurable.
Norm referenced test v criterion referenced test
Compare a test taker's score with the scores of a group of test takers who took the test previously (or could also be current cohort) (SAT, ACT, GRE) -Used to determine how well an individual achievement compare with others -test score compared to others. percentiles Compare a test taker's score with an objectively stated standard of achievement (e.g., learning objectives for the chapters in this text) -Used to learn whether individual learned specific knowledge or can demonstrate a skill -test score compared to highest possible score. percentages
Concrete v Abstract attributes
Concrete attributes: can be clearly described in terms of observable and measurable behaviors. Ex: baking a cake, playing the piano, math knowledge Abstract attributes: more difficult to describe in terms of behaviors bc people may disagree on what the behaviors represent Ex: intelligence, personality, creativity
What are the traditional views of validity? V the current one?
Content Validity: evidence based on content Criterion-Related Validity: Evidence based on response process/relations with other variables Construct Validity: evidence based on internal structure -This view suggested that there were different "types" of validity. -The current view is that validity is a single concept with multiple sources of evidence to demonstrate it.
Evidence of Construct validity? (2 types)
Convergent validity: evidence that test scores correlated with scores on other tests that measure the same construct Discriminant validity: evidence that test scores are not correlated with unrelated constructs
Correlational coefficient v Validity coefficients
Correlational coefficient: quantitative estimate of the relationship between two variables Validity coefficients: correlation coefficients for predictive evidence and concurrent evidence
Correlational coefficient v reliability coefficient
Correlational: -1 to 1 Reliability coefficient: 0 to +1. Negative reliability would probably indicate a computational problem
Multitrait-multimethod (MTMM) design:
Donald Campbell and Donald Fiske cleverly combined the need to collect evidence of reliability/precision, convergent evidence of validity, and discriminant evidence of validity into one study. monotrait-hetero method hetero trait mono method hetero trait hetero method
What are the two methods for obtaining evidence of validity based on test content?
During Test Development After Test Development
Effect sizes for Eta squared
Effect Size* .01 = small .059 = medium .138 = large
Ways to establish quantitative evidence
Experimental interventions Evidence of validity based on content Evidence of validity based on relations with external criteria Multiple studies
Gathering evidence of construct validity (traingles)
Full triangel: hetero trait mono method: multitrait method correlation matrix: want low correlations. If aren't low you have hetero variance Dashed triangle: hetero trait heteromono method --> discriminant validity: want low correlations not in triangle: Mono triat, hetero method: convergent validity: want high correlations
Test of significance
Help us determine the relationship between the test and the criterion and the likelihood of the relationship would have been found by chance alone How likely is it that the correlation between the test and the criterion resulted from chance or sampling error?" If p<.05 we can be confident that the test and criterion are related. And it is not due to chance or sampling error. Significant means there is evidence of a true relationship and not significant means there is no relationship
Calculating internal reliability for heterogenous tests
Heterogenous tests have multiple subtests or factors Accounting skills Calculations skills Use of spreadsheet
What is the difference between interrater reliability and interrater agreement?
Interrater reliability: given the test once, and have it scored (interval or ratio-level data) by scorers or two methods Interrater agreement: create a rating instruments and have it completed by two judges (nominal or ordinal level data)
Interrater agreement v intrarater agreement and formula
Interrater: an index of how consistently the scores rate or make decisions on same test -Formula: cohen's kappa Intrarater: when one scorer makes judgments, the researchers also wants assurance that that scorer makes consistent judgments across all tests Intrarater agreement: Intraclass correlation coefficient
Interscorer v intrascorer
Interscorer aka scorer reliability: the amount of consistency among scorer judgements Intrascorer reliability: whether each scorer was consistent in the way he or she assigned scores from test to test
During Test Development
Involves performing series of systematic steps 1. Defining the testing universe 2. Developing the testing specifications: 3. Establishing a test format 4. Constructing test questions
Objective criterion v subjective criterion
Is observable and measurable E.g., number of accidents on the job, number of days absent, number of disciplinary problems in a month, behavioral observation, GPA, withdrawal or dismissal Subjective criterion: based on a person's judgement E.g., supervisor and peer ratings, teachers recommendation, diagnosis
For the test-retest graph on the slide. Do you think self-efficacy changes over time?
It does not bc r=.752 is high, so it is stable over time. the sig was .000
Job analysis
Job analysis: a process that identifies the knowledge, skills, abilities, and other characteristics required to perform a job.
Concurrent method
Large group takes the test (the predictor) Same group takes another measure (the criterion) with evidence of reliability and validity. Tests are taken at same point in time Scores are correlated
Predictive method
Large group takes the test (the predictor) and their scores are held (ex: for 6 months) In the future, same group is administered a second measure with evidence of reliability and validity (the criterion) Scores are calculated
In general, what kinds of tests are more reliable?
Longer tests
How to create y= mx + b formula from statistical output?
Look at coefficients table. Under B where hrs study that is x, and plug in Constant under B so y'=9.732 + .622x
Where do you like in the statistical output for how adding a 2nd predictor affected incremental validity?
Look at r square change
Low incremental validity v high incremental validity
Low: large overlap between test 1 and test 2 High: small overlap between test 1 and 2 denoted as r1,2
What is reliability?
Measures consistency -A reliable test is one we can trust to measure each person in approximately the same way every time it is used *random definitions: reliability refers to the degree to which test scores are free from errors of measurement ... the same result on repeated trials
Homogeneous tests v heterogeneous tests
Measuring only one trait/characteristic v measuring more than one trait/characteristic
multicollinearity
Multicollinearity is a statistical concept where several independent variables in a model are correlated Multicollinearity among independent variables will result in less reliable statistical inferences. mot cortex and alc intake highly correlated w each other .797 when two predictors highly correlated/related, harder to tease out the unique contribution of the predictors in an ideal world. can't tell whether really predict, or overlap/related
What is the challenge of test-retest reliability?
Practice effects and fatigue Make interval long enough for practice effects and make sure test takers not permanently changed by taking test
Evidence based on relations with other variables: previously criterion-related validity: What are the two types of evidence?
Predictive validity: extent to which scores predict future behavior Concurrent validity: extent to which scores correlate with current performance
Random error v Systematic error
Random error: will increase/decrease a person's score by exactly the same amount with infinite testing, cancels itself out, lowers reliability of test Systematic error: occurs when source of error always increase or decrease a true score, does not lower reliability of a test since the test is reliably inaccurate by the same amount each time
What does reliability depend on and what does validity depend on?
Reliability depends on characteristics of the test itself Validity depends on the inferences that are going to be made from test scores
Reliability (statistically speaking)
Reliability is the ratio of true score variance over observed score variance % of observed score variance attributable to true score variance
Reliability v Precision per text
Reliability/precision in general For statistical evaluation, reliability coefficient is preferred
Exploratory v Confirmatory Analysis
Researchers do not propose a formal hypothesis about the factors that underlie a set of test scores, but instead use the procedure broadly to help identify underlying components (not based on theory/hypohtesis) Confirmatory: Researchers specify in advance what they believe the factor structure of the data should look like and then statistically test how well that model actually fits the data. statistical procedure to cofnirm whether factors in a theory actually exist
What programs calculate coefficinet alpha and KR-20?
SAS and SPSS
Eventhough its underreported, suicide is what number cause of death among adolescents?
Second or third
Random error
Unexplained difference between a person's actual score on a test (the obtained score) and that person's true score lowers reliability
What can we assume about SEM if an individual took a test an infinite number of times:
Standard deviation is the measure of spread in your sample. Standard error is more of an estimate of the population—it allows us to compute a confidence interval to estimate the location of the true population mean. So if you compute a 95% confidence interval (SE * +/- 1.96), you can make the claim that if you computed the sample mean and infinite number of times, the true population mean will fall within the confidence interval 95% of the time. -Approximately 68% of the observed test scores (X) would occur within +/-1 SEM os the true score -Approximately 95% of the observed test scores (X) would occur within + 2 SEM of the true score (T) -Approximately 99.7 percent of the observed test scores (X) would occur within + 3 SEM of the true score (T) -use this info to create confidence intervals: a range of scores that we feel confident will include the test taker's score
Spearman and Brown formula
Statistically adjusts the reliability coefficient when test length is reduced to estimate what the reliability would have been if the test were longer -Used for split half reliability: which correlates first half of measure with second half of measure
"If a student scores 65 on the ASE test, what course grade would we expect the student to receive?" PROCESS
Step 1: Calculate the means and standard deviations of X and Y. Step 2: Calculate the correlation coefficient (rxy) for X and Y. Step 3: Calculate the slope and intercept. Step 4: Calculate Y′ when X = 65. Step 5: Translate the number calculated for Y′ back into a letter grade.
How is split halves done?
Test questions in the original test were randomly assigned to split half 1 and other half to split half 2. This procedure resulted in two tests, each one half as long as the original test --> the more questions on a test, the higher the reliability --> therefore we must adjust the reliability coefficient through the spearman brown formula
What are the four types of reliability coefficients?
Test-retest, alternate forms, internal consistency, scorer
What are the two methods for evaluating validity coefficients?
Tests of significance Coefficient of determination
KR-20
Use when questions can be scored either right or wrong, so true-false and mc
Criterion
The measure of performance (independent behaviors, attitudes, events) that we correlate with test scores
True score
The score that would be obtained if an individual took a test an infinite number of times and then the average scores of all testings were computed
monotrait monomethod
These are the correlations between the same (mono) traits using the same (method). This is equivalent to correlating a test with itself, so it is really a reliability coefficient
What is the mathematical relationship of validity and reliability?
Validity coefficient can't be greater than sqrt reliability if rxx= .64, then rxy has to be less than .8 -test needs to be reliable to also be valid
Gathering Psychometric Evidence
Ways to establish quantitative evidence Reliability Convergent evidence of validity Discriminant evidence of validity Multitrait-multimethod design
What does cross loading mean
When an item is correlated equally to two components
The concurrent method
When test administration and criterion measurement happen at the same time -Appropriate for validating clinical tests that diagnose behavioral, emotional, or mental disorders and selection tests -Often used for selection tests because employers do not want to hire applicants with low test scores or wait a long time to get criteria data
Evidence of validity based on relationships with external criteria
When test scores correlate with independent behaviors, attitudes, or events
Multiple regression
Y′ = a + b1X1 + b2X2 + ... Y′ = the predicted score on the criterion a = the intercept b = the slope of the regression line and amount of variance the predictor contributes to the equation, also known as beta (β) X = the predictor
Developing the testing specifications
a documented plan containing details about a test's content....similar a to blueprint *Includes Content areas: the subject matter that the test will measure.
Nomological network
a method for defining a construct by illustrating its relation to as many other constructs and behaviors as possible
The one that looks like a box plot but isnt is called
a scatter bar
Coefficient of multiple determination
a statistic for interpreting the results of a multiple regression.
Validity (traditional v current view)
accuracy Traditional: Does the test measure what it was designed to measure? Current: Does the test have evidence of validity for its intended use? A test can measure what it was designed to measure, but not be valid for a particular purpose (in class example of scale (kg) and intelligence)
An alpha of .70 may be fine for exploratory studies but not for...
basing critical, individual decisions (hire/fire, medicate/don't medicate)
Order effects
changes in test scores resulting from the order in which the tests were taken differences in participant responses as a result of the order in which treatments are presented to them.
Construct explication
defining or explaining a psychological construct 1) Identify the behaviors that relate to the construct. 2) Identify other constructs that may be related to the construct being explained. 3) Identify behaviors related to similar constructs, and determine whether these behaviors are related to the original construct.
Parallel forms
describes different forms of the same test
Goodness-of-fit test
evidence that the factors obtained empirically are similar to those proposed theoretically.
Criterion contamination
if the criterion measures more dimensions than those measured by the test
Convergent evidence of validity:
if the test is measuring a particular construct, we expect the scores on the test to correlate strongly with scores on other tests that measure the same or similar constructs.
Competency modeling
is a procedure that identifies the knowledge, skills, abilities, and other characteristics most critical for success on some or all of the jobs in an organization
Linear regression v multiple regrresion
linear uses one predictor/test to predict a criterion multiple regression uses more than one predictor/test to predict criterion
Correction for attenuation
rxoyo = rxtyt/sqrt(RxxRyy) Correction for attenuation (CA) is a method that allows researchers to estimate the relationship between two constructs as if they were measured perfectly reliably and free from random errors that occur in all observed measures.
Spearman brown formula
rxx=nr/1+ (n-1)(r) rxx=estimated reliability of test n=number of questions in the revised version divided by the number of questions in the original version of the test r=the calculated correlation coefficient for two short forms of the test
Content validity is heavily reliant on the assumption
that the test constructor has truly tapped the domain of interest
Defining the testing universe
the body of knowledge or behaviors that a test represents.....review other instruments...interview experts...research the construct
slope (b) of the regression line
the expected change in one unit of Y for every change in X b=r(Sy/Sx) where r = the correlation coefficient Sx = the standard deviation of the distribution of X Sy = the standard deviation of the distribution of Y
Intercept
the place where the regression line crosses the y axis. when the predictor is 0 -sometimes doesnt mean anything if no one scores a zero, or if that score not possible a= Yline - bXline Yline= the mean of distribution of Y b=slope Xline= the mean of the distribution of X
Attenuation
the reduction in the validity coefficient (given unreliablity) rxoyo = rxtyt*sqrt(RxxRyy)
Construct validity process involves gathering two types of data?
theoretical evidence psychometric evidence
When a relationship can be established between test scores and a criterion, we can...
use test scores from other individuals to predict how well those individuals will perform on the criterion measure.
Although the general incidence of suicide has decreased during the past two decades, the rate for people between 15 and 24 years of age has
tripled
Measurement error
variations in measurement using a reliable instrument
Restriction of range
when a range of test scores or criterion scores is reduced or restricted restricting a range usually r becomes lower
Discriminant evidence of validity:
when the test scores do not correlate with unrelated constructs.